Re: [Python-Dev] Unicode byte order mark decoding

5 Apr 2005

      Stephen J. Turnbull wrote:
...
...
...
...
...
...
"MAL" == M   writes:
MAL> The BOM (byte order mark) was a non-standard Microsoft
    MAL> invention to detect Unicode text data as such (MS always uses
    MAL> UTF-16-LE for Unicode text files).
The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds
them to existing UTF-8 files lacking them.
Is that a MS application ? AFAIK, notepad, wordpad and MS Office
always use UTF-16-LE + BOM when saving text as "Unicode text".
...
MAL> -1; there's no standard for UTF-8 BOMs - adding it to the
    MAL> codecs module was probably a mistake to begin with. You
    MAL> usually only get UTF-8 files with BOM marks as the result of
    MAL> recoding UTF-16 files into UTF-8.
There is a standard for UTF-8 _signatures_, however.  I don't have the
most recent version of the ISO-10646 standard, but Amendment 2 (which
defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to
Annex F of that standard.  Evan quotes Version 4 of the Unicode
standard, which explicitly defines the UTF-8 signature.
Ok, as signature the BOM does make some sense - whether to
strip signatures from a document is a good idea or not
is a different matter, though.

Here's the Unicode Cons. FAQ on the subject:

	http://www.unicode.org/faq/utf_bom.html#22

They also explicitly warn about adding BOMs to UTF-8 data
since it can break applications and protocols that do not
expect such a signature.
...
So there is a standard for the UTF-8 signature, and I know of
applications which produce it.  While I agree with you that Python's
codecs shouldn't produce it (by default), providing an option to strip
is a good idea.
However, this option should be part of the initialization of an IO
stream which produces Unicodes, _not_ an operation on arbitrary
internal strings (whether raw or Unicode).
Right.
...
MAL> BTW, how do you know that s came from the start of a file and
    MAL> not from slicing some already loaded file somewhere in the
    MAL> middle ?
The programmer or the application might, but Python's codecs don't.
The point is that this is also true of rawstrings that happen to
contain UTF-16 or UTF-32 data.  The UTF-16 ("auto-endian") codec
shouldn't strip leading BOMs either, unless it has been told it has
the beginning of the string.
The UTF-16 stream codecs implement this logic.

The UTF-16 encode and decode functions will however always strip
the BOM mark from the beginning of a string.

If the application doesn't want this stripping to happen,
it should use the UTF-16-LE or -BE codec resp.
...
MAL> Evan Jones wrote:
>> This is *not* a valid Unicode character. The Unicode
    >> specification (version 4, section 15.8) says the following
    >> about non-characters:
    >> 
    >>> Applications are free to use any of these noncharacter code
    >>> points internally but should never attempt to exchange
    >>> them. If a noncharacter is received in open interchange, an
    >>> application is not required to interpret it in any way. It is
    >>> good practice, however, to recognize it as a noncharacter and
    >>> to take appropriate action, such as removing it from the
    >>> text. Note that Unicode conformance freely allows the removal
    >>> of these characters. (See C10 in Section3.2, Conformance
    >>> Requirements.)
    >> 
    >> My interpretation of the specification means that Python should
The specification _permits_ silent removal; it does not recommend.
>> silently remove the character, resulting in a zero length
    >> Unicode string.  Similarly, both of the following lines should
    >> also result in a zero length Unicode string:
>>>> '\xff\xfe\xfe\xff'.decode( "utf16" )
    > u'\ufffe'
    >>>> '\xff\xfe\xff\xff'.decode( "utf16" )
    > u'\uffff'
I strongly disagree; these decisions should be left to a higher layer.
In the case of specified UTFs, the codecs should simply invert the UTF
to Python's internal encoding.
MAL> Hmm, wouldn't it be better to raise an error ? After all, a
    MAL> reversed BOM mark in the stream looks a lot like you're
    MAL> trying to decode a UTF-16 stream assuming the wrong byte
    MAL> order ?!
+1 on (optionally) raising an error.
The advantage of raising an error is that the application
can deal with the situation in whatever way seems fit (by
registering a special error handler or by simply using
"ignore" or "replace").

I agree that much of this lies outside the scope of codecs
and should be handled at an application or protocol level.
...
-1 on removing it or anything
like that, unless under control of the application (ie, the program
written in Python, not Python itself).  It's far too easy for software
to generate broken Unicode streams[1], and the choice of how to deal
with those should be with the application, not with the implementation
language.
Footnotes: 
[1]  An egregious example was the Outlook Express distributed with
early Win2k betas, which produced MIME bodies with apparent
Content-Type: text/html; charset=utf-16, but the HTML tags and
newlines were 7-bit ASCII!
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 05 2005)
...
...
...
Python/Zope Consulting and Support ...        http://www.egenix.com/
mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Re: [Python-Dev] Unicode byte order mark decoding

M.-A. Lemburg