Evan Jones wrote:
I recently rediscovered this strange behaviour in Python's Unicode handling. I *think* it is a bug, but before I go and try to hack together a patch, I figure I should run it by the experts here on Python-Dev. If you understand Unicode, please let me know if there are problems with making these minor changes.
import codecs codecs.BOM_UTF8.decode( "utf8" )
codecs.BOM_UTF16.decode( "utf16" )
Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder turns it into a character?
The BOM (byte order mark) was a non-standard Microsoft invention to detect Unicode text data as such (MS always uses UTF-16-LE for Unicode text files).
It is not needed for the UTF-8 because that format doesn't rely on the byte order and the BOM character at the beginning of a stream is a legitimate ZWNBSP (zero width non breakable space) code point.
The "utf-16" codec detects and removes the mark, while the two others "utf-16-le" (little endian byte order) and "utf-16-be" (big endian byte order) don't.
The UTF-16 decoder contains logic to correctly handle the BOM. It even handles byte swapping, if necessary. I propose that the UTF-8 decoder should have the same logic: it should remove the BOM if it is detected at the beginning of a string.
-1; there's no standard for UTF-8 BOMs - adding it to the codecs module was probably a mistake to begin with. You usually only get UTF-8 files with BOM marks as the result of recoding UTF-16 files into UTF-8.
This will remove a bit of manual work for Python programs that deal with UTF-8 files created on Windows, which frequently have the BOM at the beginning. The Unicode standard is unclear about how it should be handled (version 4, section 15.9):
Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. [...] Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space.
At the very least, it would be nice to add a note about this to the documentation, and possibly add this example function that implements the "UTF-8 or ASCII?" logic:
def autodecode( s ): if s.beginswith( codecs.BOM_UTF8 ): # The byte string s is UTF-8 out = s.decode( "utf8" ) return out[1:] else: return s.decode( "ascii" )
Well, I'd say that's a very English way of dealing with encoded text ;-)
BTW, how do you know that s came from the start of a file and not from slicing some already loaded file somewhere in the middle ?
As a second issue, the UTF-16LE and UTF-16BE encoders almost do the right thing: They turn the BOM into a character, just like the Unicode specification says they should.
codecs.BOM_UTF16_LE.decode( "utf-16le" )
codecs.BOM_UTF16_BE.decode( "utf-16be" )
However, they also *incorrectly* handle the reversed byte order mark:
codecs.BOM_UTF16_BE.decode( "utf-16le" )
This is *not* a valid Unicode character. The Unicode specification (version 4, section 15.8) says the following about non-characters:
Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters. (See C10 in Section3.2, Conformance Requirements.)
My interpretation of the specification means that Python should silently remove the character, resulting in a zero length Unicode string. Similarly, both of the following lines should also result in a zero length Unicode string:
'\xff\xfe\xfe\xff'.decode( "utf16" )
'\xff\xfe\xff\xff'.decode( "utf16" )
Hmm, wouldn't it be better to raise an error ? After all, a reversed BOM mark in the stream looks a lot like you're trying to decode a UTF-16 stream assuming the wrong byte order ?!
Other than that: +1 on fixing this case.