[Python-Dev] Unicode byte order mark decoding
M.-A. Lemburg
mal at egenix.com
Thu Apr 7 11:07:58 CEST 2005
Nicholas Bastin wrote:
>
> On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:
>
>> Note that the UTF-16 codec is strict w/r to the presence
>> of the BOM mark: you get a UnicodeError if a stream does
>> not start with a BOM mark. For the UTF-8-SIG codec, this
>> should probably be relaxed to not require the BOM.
>
>
> I've actually been confused about this point for quite some time now,
> but never had a chance to bring it up. I do not understand why
> UnicodeError should be raised if there is no BOM. I know that PEP-100
> says:
>
> 'utf-16': 16-bit variable length encoding (little/big endian)
>
> and:
>
> Note: 'utf-16' should be implemented by using and requiring byte order
> marks (BOM) for file input/output.
>
> But this appears to be in error, at least in the current unicode
> standard. 'utf-16', as defined by the unicode standard, is big-endian
> in the absence of a BOM:
>
> ---
> 3.10.D42: UTF-16 encoding scheme:
> ...
> * The UTF-16 encoding scheme may or may not begin with a BOM. However,
> when there is no BOM, and in the absence of a higher-level protocol, the
> byte order of the UTF-16 encoding scheme is big-endian.
> ---
The problem is "in the absence of a higher level protocol": the
codec doesn't know anything about a protocol - it's the application
using the codec that knows which protocol get's used. It's a lot
safer to require the BOM for UTF-16 streams and raise an exception
to have the application decide whether to use UTF-16-BE or the
by far more common UTF-16-LE.
Unlike for the UTF-8 codec, the BOM for UTF-16 is a configuration
parameter, not merely a signature.
In terms of history, I don't recall whether your quote was
already in the standard at the time I wrote the PEP. You are the
first to have reported a problem with the current implementation
(which has been around since 2000), so I believe that application
writers are more comfortable with the way the UTF-16 codec
is currently implemented. Explicit is better than implicit :-)
> The current implementation of the utf-16 codecs makes for some
> irritating gymnastics to write the BOM into the file before reading it
> if it contains no BOM, which seems quite like a bug in the codec.
The codec writes a BOM in the first call to .write() - it
doesn't write a BOM before reading from the file.
> I allow for the possibility that this was ambiguous in the standard when
> the PEP was written, but it is certainly not ambiguous now.
See above.
Thanks,
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Apr 07 2005)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
More information about the Python-Dev
mailing list