[Python-3000] BOM handling

Wed Sep 13 22:45:27 CEST 2006

Antoine Pitrou wrote:
> Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit :
>> And is generally ignored, as per unicode spec; it's a "zero width
>> non-breaking space" - an invisible character with no effect on wrapping
>> or otherwise.
> 
> Well it would be better if Py3K (with all strings unicode) makes things
> easy for the programmer and abstracts away those "invisible characters
> with no textual meaning". Currently it's not the case:
> 
>>>> a = "hello".decode("utf-8")
>>>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8")
>>>> len(a)
> 5
>>>> len(b)
> 6
>>>> a == b
> False

This behavior is questionable...

>>>> a = "hello".encode("utf-16le").decode("utf-16le")
>>>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le")
>>>> len(a)
> 5
>>>> len(b)
> 6

... while this is IMHO not. UTF-16LE does not have a BOM as byte order is already
specified by the encoding. The correct example is

b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16")

b then equals u"hello", as it should.

"hello".encode("utf-16") prepends a BOM itself.

Georg