[Python-Dev] Unicode byte order mark decoding

Evan Jones ejones at uwaterloo.ca
Fri Apr 1 21:36:07 CEST 2005


I recently rediscovered this strange behaviour in Python's Unicode 
handling. I *think* it is a bug, but before I go and try to hack 
together a patch, I figure I should run it by the experts here on 
Python-Dev. If you understand Unicode, please let me know if there are 
problems with making these minor changes.


 >>> import codecs
 >>> codecs.BOM_UTF8.decode( "utf8" )
u'\ufeff'
 >>> codecs.BOM_UTF16.decode( "utf16" )
u''

Why does the UTF-16 decoder discard the BOM, while the UTF-8 decoder 
turns it into a character? The UTF-16 decoder contains logic to 
correctly handle the BOM. It even handles byte swapping, if necessary. 
I propose that  the UTF-8 decoder should have the same logic: it should 
remove the BOM if it is detected at the beginning of a string. This 
will remove a bit of manual work for Python programs that deal with 
UTF-8 files created on Windows, which frequently have the BOM at the 
beginning. The Unicode standard is unclear about how it should be 
handled (version 4, section 15.9):

> Although there are never any questions of byte order with UTF-8 text, 
> this sequence can serve as signature for UTF-8 encoded text where the 
> character set is unmarked. [...] Systems that use the byte order mark 
> must recognize when an initial U+FEFF signals the byte order. In those 
> cases, it is not part of the textual content and should be removed 
> before processing, because otherwise it may be mistaken for a 
> legitimate zero width no-break space.

At the very least, it would be nice to add a note about this to the 
documentation, and possibly add this example function that implements 
the "UTF-8 or ASCII?" logic:

def autodecode( s ):
	if s.beginswith( codecs.BOM_UTF8 ):
		# The byte string s is UTF-8
		out = s.decode( "utf8" )
		return out[1:]
	else: return s.decode( "ascii" )


As a second issue, the UTF-16LE and UTF-16BE encoders almost do the 
right thing: They turn the BOM into a character, just like the Unicode 
specification says they should.

 >>> codecs.BOM_UTF16_LE.decode( "utf-16le" )
u'\ufeff'
 >>> codecs.BOM_UTF16_BE.decode( "utf-16be" )
u'\ufeff'

However, they also *incorrectly* handle the reversed byte order mark:

 >>> codecs.BOM_UTF16_BE.decode( "utf-16le" )
u'\ufffe'

This is *not* a valid Unicode character. The Unicode specification 
(version 4, section 15.8) says the following about non-characters:

> Applications are free to use any of these noncharacter code points 
> internally but should never attempt to exchange them. If a 
> noncharacter is received in open interchange, an application is not 
> required to interpret it in any way. It is good practice, however, to 
> recognize it as a noncharacter and to take appropriate action, such as 
> removing it from the text. Note that Unicode conformance freely allows 
> the removal of these characters. (See C10 in Section3.2, Conformance 
> Requirements.)

My interpretation of the specification means that Python should 
silently remove the character, resulting in a zero length Unicode 
string. Similarly, both of the following lines should also result in a 
zero length Unicode string:

 >>> '\xff\xfe\xfe\xff'.decode( "utf16" )
u'\ufffe'
 >>> '\xff\xfe\xff\xff'.decode( "utf16" )
u'\uffff'


Thanks for your feedback,

Evan Jones



More information about the Python-Dev mailing list