[Python-Dev] Unicode byte order mark decoding

Tue Apr 5 12:34:53 CEST 2005

Stephen J. Turnbull wrote:
>>>>>>"MAL" == M  <mal at egenix.com> writes:
> 
> 
>     MAL> The BOM (byte order mark) was a non-standard Microsoft
>     MAL> invention to detect Unicode text data as such (MS always uses
>     MAL> UTF-16-LE for Unicode text files).
> 
> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds
> them to existing UTF-8 files lacking them.

Is that a MS application ? AFAIK, notepad, wordpad and MS Office
always use UTF-16-LE + BOM when saving text as "Unicode text".

>     MAL> -1; there's no standard for UTF-8 BOMs - adding it to the
>     MAL> codecs module was probably a mistake to begin with. You
>     MAL> usually only get UTF-8 files with BOM marks as the result of
>     MAL> recoding UTF-16 files into UTF-8.
> 
> There is a standard for UTF-8 _signatures_, however.  I don't have the
> most recent version of the ISO-10646 standard, but Amendment 2 (which
> defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to
> Annex F of that standard.  Evan quotes Version 4 of the Unicode
> standard, which explicitly defines the UTF-8 signature.

Ok, as signature the BOM does make some sense - whether to
strip signatures from a document is a good idea or not
is a different matter, though.

Here's the Unicode Cons. FAQ on the subject:

	http://www.unicode.org/faq/utf_bom.html#22

They also explicitly warn about adding BOMs to UTF-8 data
since it can break applications and protocols that do not
expect such a signature.

> So there is a standard for the UTF-8 signature, and I know of
> applications which produce it.  While I agree with you that Python's
> codecs shouldn't produce it (by default), providing an option to strip
> is a good idea.
> 
> However, this option should be part of the initialization of an IO
> stream which produces Unicodes, _not_ an operation on arbitrary
> internal strings (whether raw or Unicode).

Right.

>     MAL> BTW, how do you know that s came from the start of a file and
>     MAL> not from slicing some already loaded file somewhere in the
>     MAL> middle ?
> 
> The programmer or the application might, but Python's codecs don't.
> The point is that this is also true of rawstrings that happen to
> contain UTF-16 or UTF-32 data.  The UTF-16 ("auto-endian") codec
> shouldn't strip leading BOMs either, unless it has been told it has
> the beginning of the string.

The UTF-16 stream codecs implement this logic.

The UTF-16 encode and decode functions will however always strip
the BOM mark from the beginning of a string.

If the application doesn't want this stripping to happen,
it should use the UTF-16-LE or -BE codec resp.

>     MAL> Evan Jones wrote:
> 
>     >> This is *not* a valid Unicode character. The Unicode
>     >> specification (version 4, section 15.8) says the following
>     >> about non-characters:
>     >> 
>     >>> Applications are free to use any of these noncharacter code
>     >>> points internally but should never attempt to exchange
>     >>> them. If a noncharacter is received in open interchange, an
>     >>> application is not required to interpret it in any way. It is
>     >>> good practice, however, to recognize it as a noncharacter and
>     >>> to take appropriate action, such as removing it from the
>     >>> text. Note that Unicode conformance freely allows the removal
>     >>> of these characters. (See C10 in Section3.2, Conformance
>     >>> Requirements.)
>     >> 
>     >> My interpretation of the specification means that Python should
> 
> The specification _permits_ silent removal; it does not recommend.
> 
>     >> silently remove the character, resulting in a zero length
>     >> Unicode string.  Similarly, both of the following lines should
>     >> also result in a zero length Unicode string:
> 
>     >>>> '\xff\xfe\xfe\xff'.decode( "utf16" )
>     > u'\ufffe'
>     >>>> '\xff\xfe\xff\xff'.decode( "utf16" )
>     > u'\uffff'
> 
> I strongly disagree; these decisions should be left to a higher layer.
> In the case of specified UTFs, the codecs should simply invert the UTF
> to Python's internal encoding.
> 
>     MAL> Hmm, wouldn't it be better to raise an error ? After all, a
>     MAL> reversed BOM mark in the stream looks a lot like you're
>     MAL> trying to decode a UTF-16 stream assuming the wrong byte
>     MAL> order ?!
> 
> +1 on (optionally) raising an error. 

The advantage of raising an error is that the application
can deal with the situation in whatever way seems fit (by
registering a special error handler or by simply using
"ignore" or "replace").

I agree that much of this lies outside the scope of codecs
and should be handled at an application or protocol level.

> -1 on removing it or anything
> like that, unless under control of the application (ie, the program
> written in Python, not Python itself).  It's far too easy for software
> to generate broken Unicode streams[1], and the choice of how to deal
> with those should be with the application, not with the implementation
> language.
> 
> Footnotes: 
> [1]  An egregious example was the Outlook Express distributed with
> early Win2k betas, which produced MIME bodies with apparent
> Content-Type: text/html; charset=utf-16, but the HTML tags and
> newlines were 7-bit ASCII!
> 

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 05 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::