[Python-Dev] Unicode byte order mark decoding

Nicholas Bastin nbastin at opnet.com
Thu Apr 7 05:09:24 CEST 2005


On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:

> Note that the UTF-16 codec is strict w/r to the presence
> of the BOM mark: you get a UnicodeError if a stream does
> not start with a BOM mark. For the UTF-8-SIG codec, this
> should probably be relaxed to not require the BOM.

I've actually been confused about this point for quite some time now, 
but never had a chance to bring it up.  I do not understand why 
UnicodeError should be raised if there is no BOM.  I know that PEP-100 
says:

'utf-16':             16-bit variable length encoding (little/big 
endian)

and:

Note: 'utf-16' should be implemented by using and requiring byte order 
marks (BOM) for file input/output.

But this appears to be in error, at least in the current unicode 
standard.  'utf-16', as defined by the unicode standard, is big-endian 
in the absence of a BOM:

---
3.10.D42:  UTF-16 encoding scheme:
...
* The UTF-16 encoding scheme may or may not begin with a BOM.  However, 
when there is no BOM, and in the absence of a higher-level protocol, 
the byte order of the UTF-16 encoding scheme is big-endian.
---

The current implementation of the utf-16 codecs makes for some 
irritating gymnastics to write the BOM into the file before reading it 
if it contains no BOM, which seems quite like a bug in the codec.  I 
allow for the possibility that this was ambiguous in the standard when 
the PEP was written, but it is certainly not ambiguous now.

--
Nick



More information about the Python-Dev mailing list