[Tutor] unicode utf-16 and readlines [using the 'codecs' unicode file reading module]
Poor Yorick
gp@pooryorick.com
Tue Jan 7 01:01:02 2003
Danny Yoo wrote:
>
>
>You may want to use a "codec" to decode Unicode from a file. The 'codecs'
>module is specifically designed for this:
>
> http://www.python.org/doc/lib/module-codecs.html
>
>
>For example:
>
>###
>
>>>>import codecs
>>>>f = codecs.open('foo.txt', 'w', 'utf-16')
>>>>f.write("hello world")
>>>>f.close()
>>>>open('foo.txt').read()
>>>>
>'\xff\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'
>
>>>>f2 = codecs.open('foo.txt', 'r', 'utf-16')
>>>>f2.readlines()
>>>>
>[u'hello world']
>###
>
Thanks for the info! Using the codecs module is much better. It's
interesting to note, though, that when using the codecs module on a real
utf-16 text file, Python's automatic handling of new line characters
seems to break down. For example:
>>> import codecs
>>> fh = codecs.open('0022data2.txt', 'r', 'utf-16')
>>> a = fh.read()
>>> a
u'\u51fa\r\n'
>>> print a
??
>>> a = a.strip()
>>> print a
?
>>>