[Tutor] unicode utf-16 and readlines [using the 'codecs' unicode file reading module]

Poor Yorick gp@pooryorick.com
Tue Jan 7 01:01:02 2003


Danny Yoo wrote:

 >
 >
 >You may want to use a "codec" to decode Unicode from a file.  The 'codecs'
 >module is specifically designed for this:
 >
 >    http://www.python.org/doc/lib/module-codecs.html
 >
 >
 >For example:
 >
 >###
 >
 >>>>import codecs
 >>>>f = codecs.open('foo.txt', 'w', 'utf-16')
 >>>>f.write("hello world")
 >>>>f.close()
 >>>>open('foo.txt').read()
 >>>>
 >'\xff\xfeh\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00'
 >
 >>>>f2 = codecs.open('foo.txt', 'r', 'utf-16')
 >>>>f2.readlines()
 >>>>
 >[u'hello world']
 >###
 >
Thanks for the info! Using the codecs module is much better. It's
interesting to note, though, that when using the codecs module on a real
utf-16 text file, Python's automatic handling of new line characters
seems to break down. For example:

  >>> import codecs
  >>> fh = codecs.open('0022data2.txt', 'r', 'utf-16')
  >>> a = fh.read()
  >>> a
u'\u51fa\r\n'
  >>> print a
??


  >>> a = a.strip()
  >>> print a
?
  >>>