split() can help to read UTF-16 encoded file without codecs support, why?
Zhongjian Lu
zhongjian.lu at gmail.com
Fri Mar 17 03:29:35 EST 2006
Hi Guys,
I was processing a UTF-16 coded file with BOM and was not aware of the
codecs package at first. I wrote the following code:
===== Code 1============================
for i in open("d:\python24\lzjtest.xml", 'r').readlines():
i = i.decode("utf-16")
print i
=======================================
Output was:
Traceback (most recent call last):
File "D:\Python24\testutf-16.py", line 4, in -toplevel-
i = i.decode("utf-16")
File "D:\Python24\lib\encodings\utf_16.py", line 16, in decode
return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position
84: truncated data
I searched google and found an article on the similar problem saying to use
split(). I had not quite caught the meaning of the article and recode as:
==== Code 2==============================
for i in open("d:\python24\lzjtest.xml", 'r').read().split('\r\n'):
i = i.decode("utf-16")
print i
=======================================
Then it worked (echo the file).
Later I get to know codecs and write the following code:
==== Code 3 =============================
import codecs
for i in codecs.open("d:\python24\lzjtesttvs2.xml", 'r', 'utf-16').readlines():
print i
=======================================
It worked and echo the file.
I am wondering what is the problem with the first code and why the bug
is fixed in
the second.
Thanks in advance.
-Zhongjian
More information about the Python-list
mailing list