fileinput not Unicode compatible? / UTF16 codec problems

Fri Mar 1 12:15:36 EST 2002

Martin von Loewis <loewis at informatik.hu-berlin.de> wrote in message news:<j4n0xscs8b.fsf at informatik.hu-berlin.de>...

> I assume you first split the input into lines, then try the decoding?
> That does not work with UTF-16; you first need to decode, then split
> into lines.

Your assumption is correct :) And your explanation of what the UTF-16
decoder is doing makes a lot of sense.

> As long as it is fileinput only, I recommend to rewrite your code to
> not use that module; this is probably simpler than rewriting the
> module to support encodings in full generality.

That's what I did. I basically wrote a function that takes a path and
returns a list of strings. It's not the cleanest design right now, but
it works.

> You cannot decode UTF-16 on a line-by-line basis. Instead, you need to
> use a stream reader, which will remember the right encoding across
> .read or .readline invocations (only since Python 2.2, AFAIR). The
> most convenient way to open a Unicode stream is to use codecs.open,
> passing the encoding. In case of UTF-16, the endianness will be
> determined on first .read* invocation.

OK, that makes sense. I'll keep that (and your other remarks) in mind
when I refactor what I have now.

Thanks a lot for your help!

Jurie Horneman