[Python-3000] BOM handling
Georg Brandl
g.brandl at gmx.net
Wed Sep 13 22:45:27 CEST 2006
Antoine Pitrou wrote:
> Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit :
>> And is generally ignored, as per unicode spec; it's a "zero width
>> non-breaking space" - an invisible character with no effect on wrapping
>> or otherwise.
>
> Well it would be better if Py3K (with all strings unicode) makes things
> easy for the programmer and abstracts away those "invisible characters
> with no textual meaning". Currently it's not the case:
>
>>>> a = "hello".decode("utf-8")
>>>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8")
>>>> len(a)
> 5
>>>> len(b)
> 6
>>>> a == b
> False
This behavior is questionable...
>>>> a = "hello".encode("utf-16le").decode("utf-16le")
>>>> b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16le")
>>>> len(a)
> 5
>>>> len(b)
> 6
... while this is IMHO not. UTF-16LE does not have a BOM as byte order is already
specified by the encoding. The correct example is
b = (codecs.BOM_UTF16_LE + "hello".encode("utf-16le")).decode("utf-16")
b then equals u"hello", as it should.
"hello".encode("utf-16") prepends a BOM itself.
Georg
More information about the Python-3000
mailing list