[Python-Dev] Unicode byte order mark decoding

Fri Apr 1 22:19:40 CEST 2005

Evan Jones wrote:
> I recently rediscovered this strange behaviour in Python's Unicode
> handling. I *think* it is a bug, but before I go and try to hack
> together a patch, I figure I should run it by the experts here on
> Python-Dev. If you understand Unicode, please let me know if there are
> problems with making these minor changes.
> 
> 
>>>> import codecs
>>>> codecs.BOM_UTF8.decode( "utf8" )
> u'\ufeff'
>>>> codecs.BOM_UTF16.decode( "utf16" )
> u''
> 
> Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder
> turns it into a character? 

The BOM (byte order mark) was a non-standard Microsoft invention
to detect Unicode text data as such (MS always uses UTF-16-LE for
Unicode text files).

It is not needed for the UTF-8 because that format doesn't rely on
the byte order and the BOM character at the beginning of a stream is
a legitimate ZWNBSP (zero width non breakable space) code point.

The "utf-16" codec detects and removes the mark, while the
two others "utf-16-le" (little endian byte order) and "utf-16-be"
(big endian byte order) don't.

> The UTF-16 decoder contains logic to
> correctly handle the BOM. It even handles byte swapping, if necessary. I
> propose that  the UTF-8 decoder should have the same logic: it should
> remove the BOM if it is detected at the beginning of a string. 

-1; there's no standard for UTF-8 BOMs - adding it to the
codecs module was probably a mistake to begin with. You usually
only get UTF-8 files with BOM marks as the result of recoding
UTF-16 files into UTF-8.

> This will
> remove a bit of manual work for Python programs that deal with UTF-8
> files created on Windows, which frequently have the BOM at the
> beginning. The Unicode standard is unclear about how it should be
> handled (version 4, section 15.9):
> 
>> Although there are never any questions of byte order with UTF-8 text,
>> this sequence can serve as signature for UTF-8 encoded text where the
>> character set is unmarked. [...] Systems that use the byte order mark
>> must recognize when an initial U+FEFF signals the byte order. In those
>> cases, it is not part of the textual content and should be removed
>> before processing, because otherwise it may be mistaken for a
>> legitimate zero width no-break space.
> 
> 
> At the very least, it would be nice to add a note about this to the
> documentation, and possibly add this example function that implements
> the "UTF-8 or ASCII?" logic:
> 
> def autodecode( s ):
>     if s.beginswith( codecs.BOM_UTF8 ):
>         # The byte string s is UTF-8
>         out = s.decode( "utf8" )
>         return out[1:]
>     else: return s.decode( "ascii" )

Well, I'd say that's a very English way of dealing with encoded
text ;-)

BTW, how do you know that s came from the start of a file
and not from slicing some already loaded file somewhere
in the middle ?

> As a second issue, the UTF-16LE and UTF-16BE encoders almost do the
> right thing: They turn the BOM into a character, just like the Unicode
> specification says they should.
> 
>>>> codecs.BOM_UTF16_LE.decode( "utf-16le" )
> u'\ufeff'
>>>> codecs.BOM_UTF16_BE.decode( "utf-16be" )
> u'\ufeff'
> 
> However, they also *incorrectly* handle the reversed byte order mark:
> 
>>>> codecs.BOM_UTF16_BE.decode( "utf-16le" )
> u'\ufffe'
> 
> This is *not* a valid Unicode character. The Unicode specification
> (version 4, section 15.8) says the following about non-characters:
> 
>> Applications are free to use any of these noncharacter code points
>> internally but should never attempt to exchange them. If a
>> noncharacter is received in open interchange, an application is not
>> required to interpret it in any way. It is good practice, however, to
>> recognize it as a noncharacter and to take appropriate action, such as
>> removing it from the text. Note that Unicode conformance freely allows
>> the removal of these characters. (See C10 in Section3.2, Conformance
>> Requirements.)
> 
> 
> My interpretation of the specification means that Python should silently
> remove the character, resulting in a zero length Unicode string.
> Similarly, both of the following lines should also result in a zero
> length Unicode string:
> 
>>>> '\xff\xfe\xfe\xff'.decode( "utf16" )
> u'\ufffe'
>>>> '\xff\xfe\xff\xff'.decode( "utf16" )
> u'\uffff'

Hmm, wouldn't it be better to raise an error ? After all,
a reversed BOM mark in the stream looks a lot like you're
trying to decode a UTF-16 stream assuming the wrong
byte order ?!

Other than that: +1 on fixing this case.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 01 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::