fileinput not Unicode compatible? / UTF16 codec problems
Martin von Loewis
loewis at informatik.hu-berlin.de
Fri Mar 1 14:50:12 CET 2002
jhorneman at pobox.com (Jurie Horneman) writes:
> Is it possible that the fileinput module is not Unicode compatible?
That is certainly possible. You'd need to tell it the encoding for
opening files; that is currently not supported.
> Because I have a little endian 16-bit Unicode file and have trouble
> reading it in. Decoding it with the UTF16 LE decoder gives me a
> 'truncated data' error.
I assume you first split the input into lines, then try the decoding?
That does not work with UTF-16; you first need to decode, then split
> Oddly, this problem doesn't occur for every line.
No, but for every second line. The UTF-16 decoder will complain if you
don't give it an even number of bytes. After the first line is read,
the second will (incorrectly) start with a NUL byte, which fills this
line to an even number of bytes, again. Decoding as UTF-16 will
succeed, but will give you garbage: the wrong bytes will get grouped
to form a character.
> Is there a solution for this, apart from rewriting a number of modules
As long as it is fileinput only, I recommend to rewrite your code to
not use that module; this is probably simpler than rewriting the
module to support encodings in full generality.
Of course, patches will be welcome; if you do change fileinput, please
submit a patch to sf.net/projects/python.
> Is there any documentation on which Python modules are Unicode-aware
> or not?
Not that I'm aware of. In most cases, if issues become known, they
problems will be corrected instead of being documented.
> Oh, and how does one handle big endian / little endian Unicode when
> the UTF16 decoders look for BOMs at the start of each string, but I
> only have on at the start of the file? There seems to be no way for me
> to tell it which endianness I have, apart from circumventing the codec
> and calling the right version myself.
You cannot decode UTF-16 on a line-by-line basis. Instead, you need to
use a stream reader, which will remember the right encoding across
.read or .readline invocations (only since Python 2.2, AFAIR). The
most convenient way to open a Unicode stream is to use codecs.open,
passing the encoding. In case of UTF-16, the endianness will be
determined on first .read* invocation.
More information about the Python-list