Re: [Python-Dev] Unicode byte order mark decoding

7 Apr 2005

      Martin v. Löwis wrote:
...
Nicholas Bastin wrote:
...
It would be nice if you could optionally specify that the codec would
assume UTF-16BE if no BOM was present, and not raise UnicodeError in
that case, which would preserve the current behaviour as well as allow
users' to ask for behaviour which conforms to the standard.
Alternatively, the UTF-16BE codec could support the BOM, and do
UTF-16LE if the "other" BOM is found.
That would violate the Unicode standard - the BOM character
for UTF-16-LE and -BE must be interpreted as ZWNBSP.
...
This would also support your usecase, and in a better way. The
Unicode assertion that UTF-16 is BE by default is void these
days - there is *always* a higher layer protocol, and it more
often than not specifies (perhaps not in English words, but
only in the source code of the generator) that the default should
by LE.
I've checked the various versions of the Unicode standard
docs: it seems that the quote you have was silently introduced
between 3.0 and 4.0.

Python currently uses version 3.2.0 of the standard and I don't
think enough people are aware of the change in the standard to make
a case for dropping the exception raising in the case of a UTF-16
finding a stream without a BOM mark.

By the time we switch to 4.1 or later, we can then
make the change in the native UTF-16 codec as you
requested.

Personally, I think that the Unicode consortium should not
have introduced a default for the UTF-16 encoding byte
order. Using big endian as default in a world where most
Unicode data is created on little endian machines is not
very realistic either.

Note that the UTF-16 codec starts reading data in
the machines native byte order and then learns a possibly
different byte order by looking for BOMs.

Implementing a codec which implements the 4.0 behavior
is easy, though.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 07 2005)
...
...
...
Python/Zope Consulting and Support ...        http://www.egenix.com/
mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Re: [Python-Dev] Unicode byte order mark decoding

M.-A. Lemburg