Re: [Python-Dev] Unicode byte order mark decoding

1 Apr 2005

      Evan Jones wrote:
...
I recently rediscovered this strange behaviour in Python's Unicode
handling. I *think* it is a bug, but before I go and try to hack
together a patch, I figure I should run it by the experts here on
Python-Dev. If you understand Unicode, please let me know if there are
problems with making these minor changes.
...
...
...
import codecs
codecs.BOM_UTF8.decode( "utf8" )
u'\ufeff'
codecs.BOM_UTF16.decode( "utf16" )
u''
Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder
turns it into a character?
The BOM (byte order mark) was a non-standard Microsoft invention
to detect Unicode text data as such (MS always uses UTF-16-LE for
Unicode text files).

It is not needed for the UTF-8 because that format doesn't rely on
the byte order and the BOM character at the beginning of a stream is
a legitimate ZWNBSP (zero width non breakable space) code point.

The "utf-16" codec detects and removes the mark, while the
two others "utf-16-le" (little endian byte order) and "utf-16-be"
(big endian byte order) don't.
...
The UTF-16 decoder contains logic to
correctly handle the BOM. It even handles byte swapping, if necessary. I
propose that  the UTF-8 decoder should have the same logic: it should
remove the BOM if it is detected at the beginning of a string.
-1; there's no standard for UTF-8 BOMs - adding it to the
codecs module was probably a mistake to begin with. You usually
only get UTF-8 files with BOM marks as the result of recoding
UTF-16 files into UTF-8.
...
This will
remove a bit of manual work for Python programs that deal with UTF-8
files created on Windows, which frequently have the BOM at the
beginning. The Unicode standard is unclear about how it should be
handled (version 4, section 15.9):
...
Although there are never any questions of byte order with UTF-8 text,
this sequence can serve as signature for UTF-8 encoded text where the
character set is unmarked. [...] Systems that use the byte order mark
must recognize when an initial U+FEFF signals the byte order. In those
cases, it is not part of the textual content and should be removed
before processing, because otherwise it may be mistaken for a
legitimate zero width no-break space.
At the very least, it would be nice to add a note about this to the
documentation, and possibly add this example function that implements
the "UTF-8 or ASCII?" logic:
def autodecode( s ):
    if s.beginswith( codecs.BOM_UTF8 ):
        # The byte string s is UTF-8
        out = s.decode( "utf8" )
        return out[1:]
    else: return s.decode( "ascii" )
Well, I'd say that's a very English way of dealing with encoded
text ;-)

BTW, how do you know that s came from the start of a file
and not from slicing some already loaded file somewhere
in the middle ?
...
As a second issue, the UTF-16LE and UTF-16BE encoders almost do the
right thing: They turn the BOM into a character, just like the Unicode
specification says they should.
...
...
...
codecs.BOM_UTF16_LE.decode( "utf-16le" )
u'\ufeff'
codecs.BOM_UTF16_BE.decode( "utf-16be" )
u'\ufeff'
However, they also *incorrectly* handle the reversed byte order mark:
...
...
...
codecs.BOM_UTF16_BE.decode( "utf-16le" )
u'\ufffe'
This is *not* a valid Unicode character. The Unicode specification
(version 4, section 15.8) says the following about non-characters:
...
Applications are free to use any of these noncharacter code points
internally but should never attempt to exchange them. If a
noncharacter is received in open interchange, an application is not
required to interpret it in any way. It is good practice, however, to
recognize it as a noncharacter and to take appropriate action, such as
removing it from the text. Note that Unicode conformance freely allows
the removal of these characters. (See C10 in Section3.2, Conformance
Requirements.)
My interpretation of the specification means that Python should silently
remove the character, resulting in a zero length Unicode string.
Similarly, both of the following lines should also result in a zero
length Unicode string:
...
...
...
'\xff\xfe\xfe\xff'.decode( "utf16" )
u'\ufffe'
'\xff\xfe\xff\xff'.decode( "utf16" )
u'\uffff'
Hmm, wouldn't it be better to raise an error ? After all,
a reversed BOM mark in the stream looks a lot like you're
trying to decode a UTF-16 stream assuming the wrong
byte order ?!

Other than that: +1 on fixing this case.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 01 2005)
...
...
...
Python/Zope Consulting and Support ...        http://www.egenix.com/
mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Re: [Python-Dev] Unicode byte order mark decoding

M.-A. Lemburg