Ldap module and base64 oncoding

Mon May 27 01:15:01 EDT 2013

Hi Michael,

> Processing LDIF is one thing, doing LDAP operations another.
> 
> LDIF itself is meant to be ASCII-clean. But each attribute value can carry any
> byte sequence (e.g. attribute 'jpegPhoto'). There's no further processing by
> module LDIF - it simply returns byte sequences.
> 
> The access protocol LDAPv3 mandates UTF-8 encoding for Unicode strings on the
> wire if attribute syntax is DirectoryString, IA5String (mainly ASCII) or similar.
> 
> So if you're LDIF input returns UTF-16 encoded attribute values for e.g.
> attribute 'cn' or 'o' or another attribute not being of OctetString or Binary
> syntax something's wrong with the producer of the LDIF data.

That could be, I am using ms's ldifde.exe to dump a domino and AD directory for
comparative processing. The problem is I don't have much control on the data in
the directory and I do know that DN's have non ascii characters unique to the

> I wonder what the string really is. At least the base64-encoding you provided
> before decodes as UTF-8 but I'm not sure whether it's the right sequence of
> Unicode code points you're expecting.
> 
> >>> 'ZGV0XDMzMTB3YmJccGc='.decode('base64').decode('utf-8')
> u'det\\3310wbb\\pg'
> 
> I still can't figure out what you're really doing though. I'd recommend to
> strip down your operations to a very simple test code snippet illustrating the
> issue and post that here.

So I have removed all my likely broken attempts at working with this data and will
soon have some simple code but at this point I may have an indication of what is
awry with my data.

After parsing the data for a user I am simply taking a value from the ldif file and writing
it back out to another which fails, the value parsed is:

officestreetaddress:: T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ==

  File "C:\Python27\lib\site-packages\ldif.py", line 202, in unparse
    self._unparseChangeRecord(record)
  File "C:\Python27\lib\site-packages\ldif.py", line 181, in _unparseChangeRecord
    self._unparseAttrTypeandValue(mod_type,mod_val)
  File "C:\Python27\lib\site-packages\ldif.py", line 142, in _unparseAttrTypeandValue
    self._unfoldLDIFLine(':: '.join([attr_type,base64.encodestring(attr_value).replace('\n','')]))
  File "C:\Python27\lib\base64.py", line 315, in encodestring
    pieces.append(binascii.b2a_base64(chunk))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 7: ordinal not in range(128)

> c:\python27\lib\base64.py(315)encodestring()
-> pieces.append(binascii.b2a_base64(chunk))
(Pdb) l
310     def encodestring(s):
311         """Encode a string into multiple lines of base-64 data."""
312         pieces = []
313         for i in range(0, len(s), MAXBINSIZE):
314             chunk = s[i : i + MAXBINSIZE]
315  ->         pieces.append(binascii.b2a_base64(chunk))
316         return "".join(pieces)
317
318
319     def decodestring(s):
320         """Decode a string."""
(Pdb) args
s = Otto-Meßmer-Straße 1

So moving up a frame or two and looking at the entry dict, I see a modlist entry of:
('streetAddress', [u'Otto-Me\xdfmer-Stra\xdfe 1']) which is correct:

In [2]: 'T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ=='.decode('base64').decode('utf-8')
Out[2]: u'Otto-Me\xdfmer-Stra\xdfe 1'

Looking at the stack trace, I think I see the issue:
(Pdb) import base64
(Pdb) base64.encodestring(u'Otto-Me\xdfmer-Stra\xdfe 1'.encode('utf-8')).replace('\n','')
'T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ=='

I now have the exact the value I started with. Ensuring where I ever handle the original
values that I return utf-8 decoded objects for use in a modlist to later write and Sub
classing LDIFWriter and overriding _unparseAttrTypeandValue to do the encoding has
eliminated all the errors.

What remains finally is ldifde.exe's output of what looks like U+00BF, or an inverted question
mark for some values, otherwise this issue looks solved.

Thanks for everything,
jlc