[Python-Dev] Unicode byte order mark decoding

M.-A. Lemburg mal at egenix.com
Thu Apr 7 17:35:38 CEST 2005


Nicholas Bastin wrote:
> 
> On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
> 
>>> The current implementation of the utf-16 codecs makes for some
>>> irritating gymnastics to write the BOM into the file before reading it
>>> if it contains no BOM, which seems quite like a bug in the codec.
>>
>>
>> The codec writes a BOM in the first call to .write() - it
>> doesn't write a BOM before reading from the file.
> 
> 
> Yes, see, I read a *lot* of UTF-16 that comes from other sources.  It's
> not a matter of writing with python and reading with python.

Ok, but I don't really follow you here: you are suggesting to
relax the current UTF-16 behavior and to start defaulting to
UTF-16-BE if no BOM is present - that's most likely going to
cause more problems that it seems to solve: namely complete
garbage if the data turns out to be UTF-16-LE encoded and,
what's worse, enters the application undetected.

If you do have UTF-16 without a BOM mark it's much better
to let a short function analyze the text by reading for first
few bytes of the file and then make an educated guess based
on the findings. You can then process the file using one
of the other codecs UTF-16-LE or -BE.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Apr 07 2005)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list