[Python-bugs-list] [ python-Bugs-222395 ] readline() of codecs.StreamReader doesn't work for"utf-16le"

Fri, 05 Apr 2002 04:15:49 -0800

Bugs item #222395, was opened at 2000-11-14 13:37
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=222395&group_id=5470

Category: Unicode
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 6
Submitted By: Nobody/Anonymous (nobody)
Assigned to: M.-A. Lemburg (lemburg)
>Summary: readline() of codecs.StreamReader doesn't work for"utf-16le"

Initial Comment:
I tried that in
BOTH Python 1.6 and Python 2.0
(operating system: Windows NT)

I wrote :

import codecs

fileName1 = "d:\sveta\unicode\try.txt"
(UTF16LE_encode, UTF16LE_decode,
 UTF16LE_streamreader, UTF16LE_streamwriter) = codecs.lookup('UTF-16LE')

output = UTF16LE_streamwriter( open(fileName1, 'wb') )
output.write(unicode('abc\n'))
output.write(unicode('def\n'))
output.close()

input = UTF16LE_streamreader( open(fileName1, 'rb') )
rl = input.readline()
print rl
input.close()

After I run it I got:

Traceback (most recent call last):
  File "d:\sveta\unicode\unicodecheck.py", line 13, in ?
    rl = input.readline()
  File "D:\Program Files\Python16\lib\codecs.py", line 250, in readline
    return self.decode(line)[0]
UnicodeError: UTF-16 decoding error: truncated data

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2002-04-05 12:15

Message:
Logged In: YES 
user_id=38388

I've checked in a patch which raises a NotImplementedError for 
.readline() on UTF-16, -LE, -BE.

This is not ideal, but more accurate than what was in place
before.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-03-05 16:45

Message:
Logged In: YES 
user_id=38388

Uhm... it does raise an exception ;-)

It is hard to fix this bug, since Unicode line breaking
is much more elaborate than standard C lib type
line breaking. The only way I see to handle this
properly is by introducing line buffering. However,
this can slow down the codec considerably.

Perhaps we should simply have the .readline()
method raise a NotImplementedError ?!

----------------------------------------------------------------------

Comment By: Jeremy Hylton (jhylton)
Date: 2002-03-01 22:37

Message:
Logged In: YES 
user_id=31392

What should be done to fix this?  It sounds like things are 
plain broken.  If readline() doesn't work, it should raise 
an exception at the very least.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2000-11-14 14:43

Message:
Some background:

.readline() is implemented in the way it is because all other
techniques would require adding real buffering to the codec (AFAIK.
at least) and this is currently out of scope.

Besides, there is another problem:  line breaking in Unicode is much
more difficult to get right than for plain ASCII, since there are a lot
more line break characters to watch out for.

.readline() is currently relying on the underlying stream to do the
line breaking. Since this doesn't know anything about encodings
it will break lines at single bytes. As a result, the input data for the
codec is broken.

To correct the problem, one would have to write a true UTF-16 codec
which implements buffering. This should be doable in Python, e.g. see
how StringIO does it. The codec would then have to read the
input data in chunks of say 1024 bytes (must be even), then
pass the data through the codec and use the .splitlines() method on
the Unicode output. Data which is not on  the current line would
have to be buffered until the next call to .read() or .readline().

Unfortunately, this technique will also break .tell(), .truncate() and friends...
it's a mess.

An easy work-around is reading in the file as a whole and then
using .splitlines() to get at the lines.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2000-11-14 14:09

Message:
A little bit of debugging suggests that the StreamReader.readline() method is naive: it calls the underlying stream's readline() method. Since in the example code the underlying stream is a regular 8-bit file, this will return an odd number of byte in the example. Because of the little-endian encoding; the file contains these hex bytes: 61 00 62 00 63 00 0a 00 ... (0a being '\n').

I'm not familiar enough with this class to tell whether this is simply inappropriate use of StreamReader, or that this should be fixed. Maybe Marc-Andre can answer t least that question?

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2000-11-14 14:02

Message:
One for Marc-Andre. (Unfortunately he's announced he'll be too busy to look at bugs this year, so if someone else has a smart idea, feel free to butt in!)

This was originally classified as a Windows bug, but it's platform independent (I can reproduce it on Linux as well).

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=222395&group_id=5470