I recently rediscovered this strange behaviour in Python's Unicode handling. I *think* it is a bug, but before I go and try to hack together a patch, I figure I should run it by the experts here on Python-Dev. If you understand Unicode, please let me know if there are problems with making these minor changes.
import codecs codecs.BOM_UTF8.decode( "utf8" ) u'\ufeff' codecs.BOM_UTF16.decode( "utf16" ) u''
Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder turns it into a character? The UTF-16 decoder contains logic to correctly handle the BOM. It even handles byte swapping, if necessary. I propose that the UTF-8 decoder should have the same logic: it should remove the BOM if it is detected at the beginning of a string. This will remove a bit of manual work for Python programs that deal with UTF-8 files created on Windows, which frequently have the BOM at the beginning. The Unicode standard is unclear about how it should be handled (version 4, section 15.9):
Although there are never any questions of byte order with UTF-8 text, this sequence can serve as signature for UTF-8 encoded text where the character set is unmarked. [...] Systems that use the byte order mark must recognize when an initial U+FEFF signals the byte order. In those cases, it is not part of the textual content and should be removed before processing, because otherwise it may be mistaken for a legitimate zero width no-break space.
At the very least, it would be nice to add a note about this to the documentation, and possibly add this example function that implements the "UTF-8 or ASCII?" logic: def autodecode( s ): if s.beginswith( codecs.BOM_UTF8 ): # The byte string s is UTF-8 out = s.decode( "utf8" ) return out[1:] else: return s.decode( "ascii" ) As a second issue, the UTF-16LE and UTF-16BE encoders almost do the right thing: They turn the BOM into a character, just like the Unicode specification says they should.
codecs.BOM_UTF16_LE.decode( "utf-16le" ) u'\ufeff' codecs.BOM_UTF16_BE.decode( "utf-16be" ) u'\ufeff'
However, they also *incorrectly* handle the reversed byte order mark:
codecs.BOM_UTF16_BE.decode( "utf-16le" ) u'\ufffe'
This is *not* a valid Unicode character. The Unicode specification (version 4, section 15.8) says the following about non-characters:
Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters. (See C10 in Section3.2, Conformance Requirements.)
My interpretation of the specification means that Python should silently remove the character, resulting in a zero length Unicode string. Similarly, both of the following lines should also result in a zero length Unicode string:
'\xff\xfe\xfe\xff'.decode( "utf16" ) u'\ufffe' '\xff\xfe\xff\xff'.decode( "utf16" ) u'\uffff'
Thanks for your feedback, Evan Jones