[Python-3000] BOM handling

Wed Sep 13 22:33:22 CEST 2006

Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit :
> And is generally ignored, as per unicode spec; it's a "zero width
> non-breaking space" - an invisible character with no effect on wrapping
> or otherwise.

Well it would be better if Py3K (with all strings unicode) makes things
easy for the programmer and abstracts away those "invisible characters
with no textual meaning". Currently it's not the case:

>>> a = "hello".decode("utf-8")
>>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8")
>>> len(a)
5
>>> len(b)
6
>>> a == b
False

>>> a = "hello".encode("utf-16le").decode("utf-16le")
>>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le")
>>> len(a)
5
>>> len(b)
6
>>> a == b
False
>>> a
u'hello'
>>> b
u'\ufeffhello'
>>> print a
hello
>>> print b
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.4/encodings/iso8859_15.py", line 18, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>

Regards

Antoine.