How do you read unicode files?
Mike C. Fletcher
mcfletch at rogers.com
Fri Jun 7 07:15:32 CEST 2002
Well, on my win2k machine, I can create a text file using notepad and
specify that it's in "Unicode" format (utf16) and load it with:
unicode( open( filename, 'r').read(), 'utf16' )
That works even if there's ANSI characters > 128, and even if I specify
"big-endian unicode" in notepad.
unicode( open( filename,'r').read(), 'utf8' )
works if I specify "UTF-8" format in notepad.
Depending on what format you want for the "standard string", you'd then
just call, for instance .encode( 'utf8') on the resulting unicode object.
Here's a sample session:
>>> data = open( filename,'r').read()
>>> u = unicode( data, 'utf16' )
>>> u.encode( 'utf8')
>>> u.encode( 'iso8859-1' )
That last is a plain, windows-native-encoding (well, my windows-native
encoding ;) ) of the unicode as a simple Python string.
Matt Gerrans wrote:
> How do you read in a unicode file and convert it to a standard string?
> It seems that when you open a file and read it, what you get is a string of
> single-byte characters. I've tried all kinds of permutations of calls to
> unicode(), decode(), encode(), etc. with different flavors of encoding
> ('utf-8', 'utf-16' and so on).
> I could parse the data myself (skipping the initial two bytes and then every
> other one -- I'm only working with ASCII in double byte format, so the high
> order byte is always 0), but I imagine there must be a way to get the
> existing tools to work.
> What I want to be able to do is write a search and replace tool that will
> work equally well on ANSI and Unicode (or double-byte) text files (without
> changing the file type, of course)...
Mike C. Fletcher
More information about the Python-list