Nicholas Bastin wrote:
On Apr 7, 2005, at 5:07 AM, M.-A. Lemburg wrote:
The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec.
The codec writes a BOM in the first call to .write() - it doesn't write a BOM before reading from the file.
Yes, see, I read a *lot* of UTF-16 that comes from other sources. It's not a matter of writing with python and reading with python.
Ok, but I don't really follow you here: you are suggesting to relax the current UTF-16 behavior and to start defaulting to UTF-16-BE if no BOM is present - that's most likely going to cause more problems that it seems to solve: namely complete garbage if the data turns out to be UTF-16-LE encoded and, what's worse, enters the application undetected.
If you do have UTF-16 without a BOM mark it's much better to let a short function analyze the text by reading for first few bytes of the file and then make an educated guess based on the findings. You can then process the file using one of the other codecs UTF-16-LE or -BE.