[Python-Dev] Unicode byte order mark decoding

M.-A. Lemburg mal at egenix.com
Fri Apr 8 00:12:53 CEST 2005


Martin v. Löwis wrote:
> Nicholas Bastin wrote:
> 
>>It would be nice if you could optionally specify that the codec would
>>assume UTF-16BE if no BOM was present, and not raise UnicodeError in
>>that case, which would preserve the current behaviour as well as allow
>>users' to ask for behaviour which conforms to the standard.
> 
> 
> Alternatively, the UTF-16BE codec could support the BOM, and do
> UTF-16LE if the "other" BOM is found.

That would violate the Unicode standard - the BOM character
for UTF-16-LE and -BE must be interpreted as ZWNBSP.

> This would also support your usecase, and in a better way. The
> Unicode assertion that UTF-16 is BE by default is void these
> days - there is *always* a higher layer protocol, and it more
> often than not specifies (perhaps not in English words, but
> only in the source code of the generator) that the default should
> by LE.

I've checked the various versions of the Unicode standard
docs: it seems that the quote you have was silently introduced
between 3.0 and 4.0.

Python currently uses version 3.2.0 of the standard and I don't
think enough people are aware of the change in the standard to make
a case for dropping the exception raising in the case of a UTF-16
finding a stream without a BOM mark.

By the time we switch to 4.1 or later, we can then
make the change in the native UTF-16 codec as you
requested.

Personally, I think that the Unicode consortium should not
have introduced a default for the UTF-16 encoding byte
order. Using big endian as default in a world where most
Unicode data is created on little endian machines is not
very realistic either.

Note that the UTF-16 codec starts reading data in
the machines native byte order and then learns a possibly
different byte order by looking for BOMs.

Implementing a codec which implements the 4.0 behavior
is easy, though.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 07 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list