Python 3.0 automatic decoding of UTF16
tjreedy at udel.edu
Sun Dec 7 10:15:29 CET 2008
John Machin wrote:
> Here's the scoop: It's a bug in the newline handling (in io.py, class
> IncrementalNewlineDecoder, method decode). It reads text files in 128-
> byte chunks. Converting CR LF to \n requires special case handling
> when '\r' is detected at the end of the decoded chunk n in case
> there's an LF at the start of chunk n+1. Buggy solution: prepend b'\r'
> to the chunk n+1 bytes and decode that -- suddenly with a 2-bytes-per-
> char encoding like UTF-16 we are 1 byte out of whack. Better (IMVH
> O) solution: prepend '\r' to the result of decoding the chunk n+1
> bytes. Each of the OP's files have \r on a 64-character boundary.
> Note: They would exhibit the same symptoms if encoded in utf-16LE
> instead of utf-16BE. With the better solution applied, the first file
> [the truncated one] gave the expected error, and the second file [the
> apparently OK one] gave sensible looking output.
>  I thought it best to be Very Humble given what you see when you
> import io
> Hope my surge protector can cope with this :-)
> NO CARRIER
Please post this on the tracker so it can get included with other io
work for 3.0.1.
More information about the Python-list