[Python-Dev] Unicode byte order mark decoding
Stephen J. Turnbull
stephen at xemacs.org
Tue Apr 5 08:25:09 CEST 2005
>>>>> "MAL" == M <mal at egenix.com> writes:
MAL> The BOM (byte order mark) was a non-standard Microsoft
MAL> invention to detect Unicode text data as such (MS always uses
MAL> UTF-16-LE for Unicode text files).
The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds
them to existing UTF-8 files lacking them.
MAL> -1; there's no standard for UTF-8 BOMs - adding it to the
MAL> codecs module was probably a mistake to begin with. You
MAL> usually only get UTF-8 files with BOM marks as the result of
MAL> recoding UTF-16 files into UTF-8.
There is a standard for UTF-8 _signatures_, however. I don't have the
most recent version of the ISO-10646 standard, but Amendment 2 (which
defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to
Annex F of that standard. Evan quotes Version 4 of the Unicode
standard, which explicitly defines the UTF-8 signature.
So there is a standard for the UTF-8 signature, and I know of
applications which produce it. While I agree with you that Python's
codecs shouldn't produce it (by default), providing an option to strip
is a good idea.
However, this option should be part of the initialization of an IO
stream which produces Unicodes, _not_ an operation on arbitrary
internal strings (whether raw or Unicode).
MAL> BTW, how do you know that s came from the start of a file and
MAL> not from slicing some already loaded file somewhere in the
MAL> middle ?
The programmer or the application might, but Python's codecs don't.
The point is that this is also true of rawstrings that happen to
contain UTF-16 or UTF-32 data. The UTF-16 ("auto-endian") codec
shouldn't strip leading BOMs either, unless it has been told it has
the beginning of the string.
MAL> Evan Jones wrote:
>> This is *not* a valid Unicode character. The Unicode
>> specification (version 4, section 15.8) says the following
>> about non-characters:
>>> Applications are free to use any of these noncharacter code
>>> points internally but should never attempt to exchange
>>> them. If a noncharacter is received in open interchange, an
>>> application is not required to interpret it in any way. It is
>>> good practice, however, to recognize it as a noncharacter and
>>> to take appropriate action, such as removing it from the
>>> text. Note that Unicode conformance freely allows the removal
>>> of these characters. (See C10 in Section3.2, Conformance
>> My interpretation of the specification means that Python should
The specification _permits_ silent removal; it does not recommend.
>> silently remove the character, resulting in a zero length
>> Unicode string. Similarly, both of the following lines should
>> also result in a zero length Unicode string:
>>>> '\xff\xfe\xfe\xff'.decode( "utf16" )
>>>> '\xff\xfe\xff\xff'.decode( "utf16" )
I strongly disagree; these decisions should be left to a higher layer.
In the case of specified UTFs, the codecs should simply invert the UTF
to Python's internal encoding.
MAL> Hmm, wouldn't it be better to raise an error ? After all, a
MAL> reversed BOM mark in the stream looks a lot like you're
MAL> trying to decode a UTF-16 stream assuming the wrong byte
MAL> order ?!
+1 on (optionally) raising an error. -1 on removing it or anything
like that, unless under control of the application (ie, the program
written in Python, not Python itself). It's far too easy for software
to generate broken Unicode streams, and the choice of how to deal
with those should be with the application, not with the implementation
 An egregious example was the Outlook Express distributed with
early Win2k betas, which produced MIME bodies with apparent
Content-Type: text/html; charset=utf-16, but the HTML tags and
newlines were 7-bit ASCII!
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
More information about the Python-Dev