[Python-Dev] Unicode byte order mark decoding
mal at egenix.com
Tue Apr 5 12:34:53 CEST 2005
Stephen J. Turnbull wrote:
>>>>>>"MAL" == M <mal at egenix.com> writes:
> MAL> The BOM (byte order mark) was a non-standard Microsoft
> MAL> invention to detect Unicode text data as such (MS always uses
> MAL> UTF-16-LE for Unicode text files).
> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds
> them to existing UTF-8 files lacking them.
Is that a MS application ? AFAIK, notepad, wordpad and MS Office
always use UTF-16-LE + BOM when saving text as "Unicode text".
> MAL> -1; there's no standard for UTF-8 BOMs - adding it to the
> MAL> codecs module was probably a mistake to begin with. You
> MAL> usually only get UTF-8 files with BOM marks as the result of
> MAL> recoding UTF-16 files into UTF-8.
> There is a standard for UTF-8 _signatures_, however. I don't have the
> most recent version of the ISO-10646 standard, but Amendment 2 (which
> defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to
> Annex F of that standard. Evan quotes Version 4 of the Unicode
> standard, which explicitly defines the UTF-8 signature.
Ok, as signature the BOM does make some sense - whether to
strip signatures from a document is a good idea or not
is a different matter, though.
Here's the Unicode Cons. FAQ on the subject:
They also explicitly warn about adding BOMs to UTF-8 data
since it can break applications and protocols that do not
expect such a signature.
> So there is a standard for the UTF-8 signature, and I know of
> applications which produce it. While I agree with you that Python's
> codecs shouldn't produce it (by default), providing an option to strip
> is a good idea.
> However, this option should be part of the initialization of an IO
> stream which produces Unicodes, _not_ an operation on arbitrary
> internal strings (whether raw or Unicode).
> MAL> BTW, how do you know that s came from the start of a file and
> MAL> not from slicing some already loaded file somewhere in the
> MAL> middle ?
> The programmer or the application might, but Python's codecs don't.
> The point is that this is also true of rawstrings that happen to
> contain UTF-16 or UTF-32 data. The UTF-16 ("auto-endian") codec
> shouldn't strip leading BOMs either, unless it has been told it has
> the beginning of the string.
The UTF-16 stream codecs implement this logic.
The UTF-16 encode and decode functions will however always strip
the BOM mark from the beginning of a string.
If the application doesn't want this stripping to happen,
it should use the UTF-16-LE or -BE codec resp.
> MAL> Evan Jones wrote:
> >> This is *not* a valid Unicode character. The Unicode
> >> specification (version 4, section 15.8) says the following
> >> about non-characters:
> >>> Applications are free to use any of these noncharacter code
> >>> points internally but should never attempt to exchange
> >>> them. If a noncharacter is received in open interchange, an
> >>> application is not required to interpret it in any way. It is
> >>> good practice, however, to recognize it as a noncharacter and
> >>> to take appropriate action, such as removing it from the
> >>> text. Note that Unicode conformance freely allows the removal
> >>> of these characters. (See C10 in Section3.2, Conformance
> >>> Requirements.)
> >> My interpretation of the specification means that Python should
> The specification _permits_ silent removal; it does not recommend.
> >> silently remove the character, resulting in a zero length
> >> Unicode string. Similarly, both of the following lines should
> >> also result in a zero length Unicode string:
> >>>> '\xff\xfe\xfe\xff'.decode( "utf16" )
> > u'\ufffe'
> >>>> '\xff\xfe\xff\xff'.decode( "utf16" )
> > u'\uffff'
> I strongly disagree; these decisions should be left to a higher layer.
> In the case of specified UTFs, the codecs should simply invert the UTF
> to Python's internal encoding.
> MAL> Hmm, wouldn't it be better to raise an error ? After all, a
> MAL> reversed BOM mark in the stream looks a lot like you're
> MAL> trying to decode a UTF-16 stream assuming the wrong byte
> MAL> order ?!
> +1 on (optionally) raising an error.
The advantage of raising an error is that the application
can deal with the situation in whatever way seems fit (by
registering a special error handler or by simply using
"ignore" or "replace").
I agree that much of this lies outside the scope of codecs
and should be handled at an application or protocol level.
> -1 on removing it or anything
> like that, unless under control of the application (ie, the program
> written in Python, not Python itself). It's far too easy for software
> to generate broken Unicode streams, and the choice of how to deal
> with those should be with the application, not with the implementation
>  An egregious example was the Outlook Express distributed with
> early Win2k betas, which produced MIME bodies with apparent
> Content-Type: text/html; charset=utf-16, but the HTML tags and
> newlines were 7-bit ASCII!
Professional Python Services directly from the Source (#1, Apr 05 2005)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
More information about the Python-Dev