[Python-Dev] Unicode byte order mark decoding

M.-A. Lemburg mal at egenix.com
Thu Apr 7 11:07:58 CEST 2005


Nicholas Bastin wrote:
> 
> On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:
> 
>> Note that the UTF-16 codec is strict w/r to the presence
>> of the BOM mark: you get a UnicodeError if a stream does
>> not start with a BOM mark. For the UTF-8-SIG codec, this
>> should probably be relaxed to not require the BOM.
> 
> 
> I've actually been confused about this point for quite some time now,
> but never had a chance to bring it up.  I do not understand why
> UnicodeError should be raised if there is no BOM.  I know that PEP-100
> says:
> 
> 'utf-16':             16-bit variable length encoding (little/big endian)
> 
> and:
> 
> Note: 'utf-16' should be implemented by using and requiring byte order
> marks (BOM) for file input/output.
> 
> But this appears to be in error, at least in the current unicode
> standard.  'utf-16', as defined by the unicode standard, is big-endian
> in the absence of a BOM:
> 
> ---
> 3.10.D42:  UTF-16 encoding scheme:
> ...
> * The UTF-16 encoding scheme may or may not begin with a BOM.  However,
> when there is no BOM, and in the absence of a higher-level protocol, the
> byte order of the UTF-16 encoding scheme is big-endian.
> ---

The problem is "in the absence of a higher level protocol": the
codec doesn't know anything about a protocol - it's the application
using the codec that knows which protocol get's used. It's a lot
safer to require the BOM for UTF-16 streams and raise an exception
to have the application decide whether to use UTF-16-BE or the
by far more common UTF-16-LE.

Unlike for the UTF-8 codec, the BOM for UTF-16 is a configuration
parameter, not merely a signature.

In terms of history, I don't recall whether your quote was
already in the standard at the time I wrote the PEP. You are the
first to have reported a problem with the current implementation
(which has been around since 2000), so I believe that application
writers are more comfortable with the way the UTF-16 codec
is currently implemented. Explicit is better than implicit :-)

> The current implementation of the utf-16 codecs makes for some
> irritating gymnastics to write the BOM into the file before reading it
> if it contains no BOM, which seems quite like a bug in the codec. 

The codec writes a BOM in the first call to .write() - it
doesn't write a BOM before reading from the file.

> I allow for the possibility that this was ambiguous in the standard when
> the PEP was written, but it is certainly not ambiguous now.

See above.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 07 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list