[Python-Dev] Unicode byte order mark decoding

Thu Apr 7 22:27:03 CEST 2005

On Apr 7, 2005, at 11:35 AM, M.-A. Lemburg wrote:

> Ok, but I don't really follow you here: you are suggesting to
> relax the current UTF-16 behavior and to start defaulting to
> UTF-16-BE if no BOM is present - that's most likely going to
> cause more problems that it seems to solve: namely complete
> garbage if the data turns out to be UTF-16-LE encoded and,
> what's worse, enters the application undetected.

The crux of my argument is that the spec declares that UTF-16 without a 
BOM is BE.  If the file is encoded in UTF-16LE and it doesn't have a 
BOM, it doesn't deserve to be processed correctly.  That being said, 
treating it as UTF-16BE if it's LE will result in a lot of invalid code 
points, so it shouldn't be non-obvious that something has gone wrong.

> If you do have UTF-16 without a BOM mark it's much better
> to let a short function analyze the text by reading for first
> few bytes of the file and then make an educated guess based
> on the findings. You can then process the file using one
> of the other codecs UTF-16-LE or -BE.

This is about what we do now - we catch UnicodeError and then add a BOM 
to the file, and read it again.  We know our files are UTF-16BE if they 
don't have a BOM, as the files are written by code which observes the 
spec.  We can't use UTF-16BE all the time, because sometimes they're 
UTF-16LE, and in those cases the BOM is set.

It would be nice if you could optionally specify that the codec would 
assume UTF-16BE if no BOM was present, and not raise UnicodeError in 
that case, which would preserve the current behaviour as well as allow 
users' to ask for behaviour which conforms to the standard.

I'm not saying that you can't work around the issue now, what I'm 
saying is that you shouldn't *have* to - I think there is a reasonable 
expectation that the UTF-16 codec conforms to the spec, and if you 
wanted it to do something else, it is those users who should be forced 
to come up with a workaround.

--
Nick