[Python-Dev] Unicode byte order mark decoding

Walter Dörwald walter at livinglogic.de
Thu Apr 7 23:32:28 CEST 2005


Nicholas Bastin sagte:

> On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
>
> [...]
>> If you do have UTF-16 without a BOM mark it's much better
>> to let a short function analyze the text by reading for first
>> few bytes of the file and then make an educated guess based
>> on the findings. You can then process the file using one
>> of the other codecs UTF-16-LE or -BE.
>
> This is about what we do now - we catch UnicodeError and
> then add a BOM  to the file, and read it again.  We know
> our files are UTF-16BE if they  don't have a BOM, as the
> files are written by code which observes the  spec.
> We can't use UTF-16BE all the time, because sometimes
> they're UTF-16LE, and in those cases the BOM is set.
>
> It would be nice if you could optionally specify that the
> codec would assume UTF-16BE if no BOM was present,
> and not raise UnicodeError in  that case, which would
> preserve the current behaviour as well as allow users'
> to ask for behaviour which conforms to the standard.

It should be feasible to implement your own codec for that
based on Lib/encodings/utf_16.py. Simply replace the line
in StreamReader.decode():
   raise UnicodeError,"UTF-16 stream does not start with BOM"
with:
   self.decode = codecs.utf_16_be_decode
and you should be done.

> [...]

Bye,
   Walter Dörwald





More information about the Python-Dev mailing list