[Python-Dev] Unicode byte order mark decoding

Evan Jones ejones at uwaterloo.ca
Sat Apr 2 05:04:11 CEST 2005


On Apr 1, 2005, at 15:19, M.-A. Lemburg wrote:
> The BOM (byte order mark) was a non-standard Microsoft invention
> to detect Unicode text data as such (MS always uses UTF-16-LE for
> Unicode text files).

Well, it's origins do not really matter since at this point the BOM is 
firmly encoded in the Unicode standard. It seems to me that it is in 
everyone's best interest to support it.

> It is not needed for the UTF-8 because that format doesn't rely on
> the byte order and the BOM character at the beginning of a stream is
> a legitimate ZWNBSP (zero width non breakable space) code point.

You are correct: it is a legitimate character. However, its use as a 
ZWNBSP character has been deprecated:

> The overloading of semantics for this code point has caused problems 
> for programs and protocols. The new character U+2060 WORD JOINER has 
> the same semantics in all cases as U+FEFF, except that it cannot be 
> used as a signature. Implementers are strongly encouraged to use word 
> joiner in those circumstances whenever word joining semantics is 
> intended.

Also, the Unicode specification is ambiguous on what an implementation 
should do about a leading ZWNBSP that is encoded in UTF-16. Like I 
mentioned, if you look at the Unicode standard, version 4, section 
15.9, it says:

> 2. Unmarked Character Set. In some circumstances, the character set 
> information for a stream of coded characters (such as a file) is not 
> available. The only information available is that the stream contains 
> text, but the precise character set is not known.

This seems to indicate that it is permitted to strip the BOM from the 
beginning of UTF-8 text.

> -1; there's no standard for UTF-8 BOMs - adding it to the
> codecs module was probably a mistake to begin with. You usually
> only get UTF-8 files with BOM marks as the result of recoding
> UTF-16 files into UTF-8.

This is clearly incorrect. The UTF-8 is specified in the Unicode 
standard version 4, section 15.9:

> In UTF-8, the BOM corresponds to the byte sequence <EF BB BF>.

I normally find files with UTF-8 BOMs from many Windows applications 
when you save a text file as UTF8. I think that Notepad or WordPad does 
this, for example. I think UltraEdit also does the same thing. I know 
that Scintilla definitely does.

>> At the very least, it would be nice to add a note about this to the
>> documentation, and possibly add this example function that implements
>> the "UTF-8 or ASCII?" logic.
> Well, I'd say that's a very English way of dealing with encoded
> text ;-)

Please note I am saying only that something like this may want to me 
considered for addition to the documentation, and not to the Python 
standard library. This example function more closely replicates the 
logic that is used on those Windows applications when opening ".txt" 
files. It uses the default locale if there is no BOM:

def autodecode( s ):
	if s.beginswith( codecs.BOM_UTF8 ):
		# The byte string s is UTF-8
		out = s.decode( "utf8" )
		return out[1:]
	else: return s.decode()

> BTW, how do you know that s came from the start of a file
> and not from slicing some already loaded file somewhere
> in the middle ?

Well, the same argument could be applied to the UTF-16 decoder know 
that the string came from the start of a file, and not from slicing 
some already loaded file? The standard states that:

> In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file 
> or stream explicitly signals the byte order.

So it is perfectly permissible to perform this type of processing if 
you consider a string to be equivalent to a stream.

>> My interpretation of the specification means that Python should 
>> silently
>> remove the character, resulting in a zero length Unicode string.
> Hmm, wouldn't it be better to raise an error ? After all,
> a reversed BOM mark in the stream looks a lot like you're
> trying to decode a UTF-16 stream assuming the wrong
> byte order ?!

Well, either one is possible, however the Unicode standard suggests, 
but does not require, silently removing them:

> It is good practice, however, to recognize it as a noncharacter and to 
> take appropriate action, such as removing it from the text. Note that 
> Unicode conformance freely allows the removal of these characters.

I would prefer silently ignoring them from the str.decode() function, 
since I believe in "be strict in what you emit, but liberal in what you 
accept." I think that this only applies to str.decode(). Any other 
attempt to create non-characters, such as unichr( 0xffff ), *should* 
raise an exception because clearly the programmer is making a mistake.

> Other than that: +1 on fixing this case.

Cool!

Evan Jones



More information about the Python-Dev mailing list