Python 3.0 automatic decoding of UTF16

Sat Dec 6 14:20:26 EST 2008

"Johannes Bauer" <dfnsonfsduifb at gmx.de> wrote in message 
news:1mmq06x4g6.ln2 at joeserver.homelan.net...
>John Machin schrieb:
>> On Dec 6, 5:36 am, Johannes Bauer <dfnsonfsdu... at gmx.de> wrote:
>>> So UTF-16 has an explicit EOF marker within the text? I cannot find one
>>> in original file, only some kind of starting sequence I suppose
>>> (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
>>> simple \r\n line ending.
>>
>> Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
>> long, an ODD number, which shouldn't happen with utf16.  The file is
>> stuffed. Python 3.0 has a bug; it should give a meaningful error
>> message.
>
>Yes, you are right. I fixed the file, yet another error pops up
>(http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.txt.html):
>
>Traceback (most recent call last):
>  File "./modify.py", line 12, in <module>
>    a = AddressBook("2008_12_05_Handy_Backup.txt")
>  File "./modify.py", line 7, in __init__
>    line = f.readline()
>  File "/usr/local/lib/python3.0/io.py", line 1807, in readline
>    while self._read_chunk():
>  File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
>    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
>  File "/usr/local/lib/python3.0/io.py", line 1293, in decode
>    output = self.decoder.decode(input, final=final)
>  File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
>    (result, consumed) = self._buffer_decode(data, self.errors, final)
>  File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
>_buffer_decode
>    return self.decoder(input, self.errors, final)
>UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
>truncated data
>
>File size is 1630 bytes - so this clearly cannot be.

How about posting your code?  The first file is incorrect.  It contains an 
extra 0x00 byte at the end of the file, but is otherwise correctly encoded 
with a big-endian UTF16 BOM and data.  The second file is a correct UTF16-BE 
file as well.

This code (Python 2.6) decodes the first file, removing the trailing extra 
byte:

    raw = open('2008_11_05_Handy_Backup.txt').read()
    data = raw[:-1].decode('utf16')

and this code (Python 2.6) decodes the second:

    raw = open('2008_12_05_Handy_Backup.txt').read()
    data = raw.decode('utf16')

Python 3.0 also has no problems with decoding or accurate error messages:

>>> data = open('2008_12_05_Handy_Backup.txt',encoding='utf16').read()
>>> data = open('2008_11_05_Handy_Backup.txt',encoding='utf16').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\dev\python30\lib\io.py", line 1724, in read
    decoder.decode(self.buffer.read(), final=True))
  File "C:\dev\python30\lib\io.py", line 1295, in decode
    output = self.decoder.decode(input, final=final)
  File "C:\dev\python30\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  File "c:\dev\python30\lib\encodings\utf_16.py", line 61, in _buffer_decode
    codecs.utf_16_ex_decode(input, errors, 0, final)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x00 in position 1558: 
trunc
ated data

-Mark