[Python-3000] BOM handling

Thu Sep 14 09:12:21 CEST 2006

Josiah Carlson wrote:
> Antoine Pitrou <solipsis at pitrou.net> wrote:
>>
>> Le mercredi 13 septembre 2006 à 09:41 -0700, Josiah Carlson a écrit :
>>> And is generally ignored, as per unicode spec; it's a "zero width
>>> non-breaking space" - an invisible character with no effect on wrapping
>>> or otherwise.
>> Well it would be better if Py3K (with all strings unicode) makes things
>> easy for the programmer and abstracts away those "invisible characters
>> with no textual meaning". Currently it's not the case:
> 
>>>>> a = "hello".decode("utf-8")
>>>>> b = (codecs.BOM_UTF8 + "hello").decode("utf-8")
>>>>> len(a)
>> 5
>>>>> len(b)
>> 6
>>>>> a == b
>> False
> 
> I had also had this particular discussion with another individual
> previously (but I can't seem to find it in my archive), and one point
> brought up was that apparently Python 2.5 was supposed to have a variant
> codec for utf-8 that automatically stripped at most one \ufeff character
> from the beginning of decoded output and added it during encoding,
> similar to how the generic 'utf-16' and 'utf-32' codecs add and strip:
> 
>>>> u'hello'.encode('utf-16')
> '\xff\xfeh\x00e\x00l\x00l\x00o\x00'
>>>> len(u'hello'.encode('utf-16').decode('utf-16'))
> 5
> 
> I'm unable to find that particular utf-8 codec in the version of Python
> 2.5 I have installed, but I may not be looking in the right places, or
> spelling it the right way.

It's called "utf-8-sig".

> In any case, I believe that the above behavior is correct for the
> context.  Why?  Because utf-8 has no endianness, its 'generic' decoding
> spelling of 'utf-8' is analagous to all three 'utf-16', 'utf-16-be', and
> 'utf-16-le' decoding spellings; two of which don't strip.

Servus,
    Walter