How do you read unicode files?

Mike C. Fletcher mcfletch at
Fri Jun 7 07:15:32 CEST 2002

Well, on my win2k machine, I can create a text file using notepad and 
specify that it's in "Unicode" format (utf16) and load it with:

unicode( open( filename, 'r').read(), 'utf16' )

That works even if there's ANSI characters > 128, and even if I specify 
"big-endian unicode" in notepad.

unicode( open( filename,'r').read(), 'utf8' )

works if I specify "UTF-8" format in notepad.

Depending on what format you want for the "standard string", you'd then 
just call, for instance .encode( 'utf8') on the resulting unicode object.

Here's a sample session:
 >>> data = open( filename,'r').read()
 >>> data
 >>> u = unicode( data, 'utf16' )
 >>> u
u'Testing unicode\r\n\xe1\xed'
 >>> u.encode( 'utf8')
'Testing unicode\r\n\xc3\xa1\xc3\xad'
 >>> u.encode( 'iso8859-1' )
'Testing unicode\r\n\xe1\xed'

That last is a plain, windows-native-encoding (well, my windows-native 
encoding ;) ) of the unicode as a simple Python string.


Matt Gerrans wrote:
> How do you read in a unicode file and convert it to a standard string?
> It seems that when you open a file and read it, what you get is a string of
> single-byte characters.   I've tried all kinds of permutations of calls to
> unicode(), decode(), encode(), etc. with different flavors of encoding
> ('utf-8',  'utf-16' and so on).
> I could parse the data myself (skipping the initial two bytes and then every
> other one -- I'm only working with ASCII in double byte format, so the high
> order byte is always 0), but I imagine there must be a way to get the
> existing tools to work.
> What I want to be able to do is write a search and replace tool that will
> work equally well on ANSI and Unicode (or double-byte) text files (without
> changing the file type, of course)...

   Mike C. Fletcher

More information about the Python-list mailing list