[Python-Dev] Unicode byte order mark decoding
Walter Dörwald
walter at livinglogic.de
Thu Apr 7 23:32:28 CEST 2005
Nicholas Bastin sagte:
> On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:
>
> [...]
>> If you do have UTF-16 without a BOM mark it's much better
>> to let a short function analyze the text by reading for first
>> few bytes of the file and then make an educated guess based
>> on the findings. You can then process the file using one
>> of the other codecs UTF-16-LE or -BE.
>
> This is about what we do now - we catch UnicodeError and
> then add a BOM to the file, and read it again. We know
> our files are UTF-16BE if they don't have a BOM, as the
> files are written by code which observes the spec.
> We can't use UTF-16BE all the time, because sometimes
> they're UTF-16LE, and in those cases the BOM is set.
>
> It would be nice if you could optionally specify that the
> codec would assume UTF-16BE if no BOM was present,
> and not raise UnicodeError in that case, which would
> preserve the current behaviour as well as allow users'
> to ask for behaviour which conforms to the standard.
It should be feasible to implement your own codec for that
based on Lib/encodings/utf_16.py. Simply replace the line
in StreamReader.decode():
raise UnicodeError,"UTF-16 stream does not start with BOM"
with:
self.decode = codecs.utf_16_be_decode
and you should be done.
> [...]
Bye,
Walter Dörwald
More information about the Python-Dev
mailing list