[Tutor] unicode utf-16 and readlines [using the 'codecs' unicode
file reading module]
Danny Yoo
dyoo@hkn.eecs.berkeley.edu
Sat Jan 4 22:23:01 2003
On Sat, 4 Jan 2003, Poor Yorick wrote:
> On Windows 2000, Python 2.2.1 open.readlines seems to read lines
> incorrectly when the file is encoded utf-16. For example:
>
> >>> fh = open('0022data2.txt')
> >>> a = fh.readlines()
> >>> print a
> ['\xff\xfe\xfaQ\r\x00\n', '\x00']
Hi Poor Yorick,
You may want to use a "codec" to decode Unicode from a file. The 'codecs'
module is specifically designed for this:
http://www.python.org/doc/lib/module-codecs.html
For example:
###
>>> import codecs
>>> f = codecs.open('foo.txt', 'w', 'utf-16')
>>> f.write("hello world")
>>> f.close()
>>> open('foo.txt').read()
'\xff\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'
>>> f2 = codecs.open('foo.txt', 'r', 'utf-16')
>>> f2.readlines()
[u'hello world']
###
> In this example, Python seems to have incorrectly parsed the \n\r
> characters at the end of the line.
If you use 'codecs' and its open() function, you should be all set. I saw
a brief mention on it in a Unicode tutorial here:
http://www.reportlab.com/i18n/python_unicode_tutorial.html
I always wanted to know what 'codecs' did. Now I know. Cool. *grin*
Thanks for the question!