How do you read unicode files?
Martin v. Loewis
martin at v.loewis.de
Fri Jun 7 08:11:09 CEST 2002
"Matt Gerrans" <mgerrans at mindspring.com> writes:
> How do you read in a unicode file and convert it to a standard string?
Depends on what you mean by "unicode file". A file always uses a
byte-oriented encoding; the term "unicode file" does not imply byte
orientation. People usually associated UTF-8 and UTF-16 (with and
without BOM) with that term.
> It seems that when you open a file and read it, what you get is a
> string of single-byte characters. I've tried all kinds of
> permutations of calls to unicode(), decode(), encode(), etc. with
> different flavors of encoding ('utf-8', 'utf-16' and so on).
The best way to read such a file is to use codecs.open. If you already
have byte data, _decoding_ them to Unicode, with the unicode builtin,
is the right thing to do.
> I could parse the data myself (skipping the initial two bytes and then every
> other one -- I'm only working with ASCII in double byte format, so the high
> order byte is always 0), but I imagine there must be a way to get the
> existing tools to work.
Ah, so there are two initial bytes. That most certainly means that you
have a file in UTF-16 with BOM; so you should use "utf-16" as the
encoding name (without BOM, you need to know the byte order, and
specify utf-16be or utf-16le respectively).
> What I want to be able to do is write a search and replace tool that
> will work equally well on ANSI and Unicode (or double-byte) text
> files (without changing the file type, of course)...
Then codecs.open is the right solution; for "ANSI" files, the encoding
is "mbcs" (neither the term "ANSI" nor the term "mbcs" are
particularly well-chosen to describe what they do).
More information about the Python-list