Python 3.0 automatic decoding of UTF16

John Machin sjmachin at lexicon.net
Fri Dec 5 17:32:23 EST 2008


On Dec 6, 5:36 am, Johannes Bauer <dfnsonfsdu... at gmx.de> wrote:
> So UTF-16 has an explicit EOF marker within the text? I cannot find one
> in original file, only some kind of starting sequence I suppose
> (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
> simple \r\n line ending.

Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
long, an ODD number, which shouldn't happen with utf16.  The file is
stuffed. Python 3.0 has a bug; it should give a meaningful error
message.

Python 2.6.0 silently ignores the problem [that's a BUG] when read by
a similar method:

| >>> import codecs
| >>> lines = codecs.open('x.txt', 'r', 'utf16').readlines()
| >>> lines[-1]
| u'[PhonePBK004]\r\n'

Python 2.x does however give a meaningful precise error message if you
try a decode on the file contents:

| >>> s = open('x.txt', 'rb').read()
| >>> len(s)
| 1559
| >>> s[-35:]
| '\x00\r\x00\n\x00[\x00P\x00h\x00o\x00n\x00e\x00P\x00B\x00K
\x000\x000\x004\x00]\x00\r\x00\n\x00'
| >>> u = s.decode('utf16')
| Traceback (most recent call last):
|   File "<stdin>", line 1, in <module>
|   File "C:\python26\lib\encodings\utf_16.py", line 16, in decode
|     return codecs.utf_16_decode(input, errors, True)
| UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position
1558: truncated data

HTH,
John



More information about the Python-list mailing list