[Python-Dev] Unicode byte order mark decoding
mal at egenix.com
Fri Apr 8 00:12:53 CEST 2005
Martin v. Löwis wrote:
> Nicholas Bastin wrote:
>>It would be nice if you could optionally specify that the codec would
>>assume UTF-16BE if no BOM was present, and not raise UnicodeError in
>>that case, which would preserve the current behaviour as well as allow
>>users' to ask for behaviour which conforms to the standard.
> Alternatively, the UTF-16BE codec could support the BOM, and do
> UTF-16LE if the "other" BOM is found.
That would violate the Unicode standard - the BOM character
for UTF-16-LE and -BE must be interpreted as ZWNBSP.
> This would also support your usecase, and in a better way. The
> Unicode assertion that UTF-16 is BE by default is void these
> days - there is *always* a higher layer protocol, and it more
> often than not specifies (perhaps not in English words, but
> only in the source code of the generator) that the default should
> by LE.
I've checked the various versions of the Unicode standard
docs: it seems that the quote you have was silently introduced
between 3.0 and 4.0.
Python currently uses version 3.2.0 of the standard and I don't
think enough people are aware of the change in the standard to make
a case for dropping the exception raising in the case of a UTF-16
finding a stream without a BOM mark.
By the time we switch to 4.1 or later, we can then
make the change in the native UTF-16 codec as you
Personally, I think that the Unicode consortium should not
have introduced a default for the UTF-16 encoding byte
order. Using big endian as default in a world where most
Unicode data is created on little endian machines is not
very realistic either.
Note that the UTF-16 codec starts reading data in
the machines native byte order and then learns a possibly
different byte order by looking for BOMs.
Implementing a codec which implements the 4.0 behavior
is easy, though.
Professional Python Services directly from the Source (#1, Apr 07 2005)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
More information about the Python-Dev