[Tutor] unicode utf-16 and readlines [using the 'codecs' unicode file reading module]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Sat Jan 4 22:23:01 2003


On Sat, 4 Jan 2003, Poor Yorick wrote:

> On Windows 2000, Python 2.2.1 open.readlines seems to read lines
> incorrectly when the file is encoded utf-16.  For example:
>
>  >>> fh = open('0022data2.txt')
>  >>> a = fh.readlines()
>  >>> print a
> ['\xff\xfe\xfaQ\r\x00\n', '\x00']

Hi Poor Yorick,


You may want to use a "codec" to decode Unicode from a file.  The 'codecs'
module is specifically designed for this:

    http://www.python.org/doc/lib/module-codecs.html


For example:

###
>>> import codecs
>>> f = codecs.open('foo.txt', 'w', 'utf-16')
>>> f.write("hello world")
>>> f.close()
>>> open('foo.txt').read()
'\xff\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'
>>> f2 = codecs.open('foo.txt', 'r', 'utf-16')
>>> f2.readlines()
[u'hello world']
###


> In this example, Python seems to have incorrectly parsed the \n\r
> characters at the end of the line.

If you use 'codecs' and its open() function, you should be all set. I saw
a brief mention on it in a Unicode tutorial here:

    http://www.reportlab.com/i18n/python_unicode_tutorial.html

I always wanted to know what 'codecs' did.  Now I know.  Cool.  *grin*


Thanks for the question!