Python 3.0 automatic decoding of UTF16

John Machin sjmachin at lexicon.net
Sat Dec 6 22:40:47 CET 2008


On Dec 7, 6:20 am, "Mark Tolonen" <metolone+gm... at gmail.com> wrote:
> "Johannes Bauer" <dfnsonfsdu... at gmx.de> wrote in message
>
> news:1mmq06x4g6.ln2 at joeserver.homelan.net...
>
>
>
> >John Machin schrieb:
> >> On Dec 6, 5:36 am, Johannes Bauer <dfnsonfsdu... at gmx.de> wrote:
> >>> So UTF-16 has an explicit EOF marker within the text? I cannot find one
> >>> in original file, only some kind of starting sequence I suppose
> >>> (0xfeff). The last characters of the file are 0x00 0x0d 0x00 0x0a,
> >>> simple \r\n line ending.
>
> >> Sorry, *WRONG*. It ends in 00 0d 00 0a 00. The file is 1559 bytes
> >> long, an ODD number, which shouldn't happen with utf16.  The file is
> >> stuffed. Python 3.0 has a bug; it should give a meaningful error
> >> message.
>
> >Yes, you are right. I fixed the file, yet another error pops up
> >(http://www.file-upload.net/download-1299688/2008_12_05_Handy_Backup.t...
>
> >Traceback (most recent call last):
> >  File "./modify.py", line 12, in <module>
> >    a = AddressBook("2008_12_05_Handy_Backup.txt")
> >  File "./modify.py", line 7, in __init__
> >    line = f.readline()
> >  File "/usr/local/lib/python3.0/io.py", line 1807, in readline
> >    while self._read_chunk():
> >  File "/usr/local/lib/python3.0/io.py", line 1556, in _read_chunk
> >    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
> >  File "/usr/local/lib/python3.0/io.py", line 1293, in decode
> >    output = self.decoder.decode(input, final=final)
> >  File "/usr/local/lib/python3.0/codecs.py", line 300, in decode
> >    (result, consumed) = self._buffer_decode(data, self.errors, final)
> >  File "/usr/local/lib/python3.0/encodings/utf_16.py", line 69, in
> >_buffer_decode
> >    return self.decoder(input, self.errors, final)
> >UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 0:
> >truncated data
>
> >File size is 1630 bytes - so this clearly cannot be.
>
> How about posting your code?

He did. Ugly stuff using readline() :-) Should still work, though.
There are definite problems with readline() and readlines(),
including:

First file: silently ignores error *and* the last line returned is
garbage [consists of multiple actual lines, and the trailing
codepoints have been byte-swapped]

Second file: as he has just reported. I've reproduced it with f.open
('second_file.txt', encoding='utf16')
followed by each of:
(1) f.readlines()
(2) list(f)
(3) for line in f:
        print(repr(line))
With the last one, the error happens after printing the last actual
line in his file.




More information about the Python-list mailing list