Re: [Python-Dev] Unicode byte order mark decoding

7 Apr 2005


      Nicholas Bastin sagte:
...
On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
[...]
...
If you do have UTF-16 without a BOM mark it's much better
to let a short function analyze the text by reading for first
few bytes of the file and then make an educated guess based
on the findings. You can then process the file using one
of the other codecs UTF-16-LE or -BE.
This is about what we do now - we catch UnicodeError and
then add a BOM  to the file, and read it again.  We know
our files are UTF-16BE if they  don't have a BOM, as the
files are written by code which observes the  spec.
We can't use UTF-16BE all the time, because sometimes
they're UTF-16LE, and in those cases the BOM is set.
It would be nice if you could optionally specify that the
codec would assume UTF-16BE if no BOM was present,
and not raise UnicodeError in  that case, which would
preserve the current behaviour as well as allow users'
to ask for behaviour which conforms to the standard.
It should be feasible to implement your own codec for that
based on Lib/encodings/utf_16.py. Simply replace the line
in StreamReader.decode():
   raise UnicodeError,"UTF-16 stream does not start with BOM"
with:
   self.decode = codecs.utf_16_be_decode
and you should be done.
...
[...]
Bye,
   Walter Dörwald