[ python-Bugs-1076985 ] Incorrect behaviour of StreamReader.readline leads to crash

Fri Dec 17 22:19:02 CET 2004

Bugs item #1076985, was opened at 2004-12-01 19:51
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1076985&group_id=5470

Category: Python Library
Group: Python 2.4
Status: Open
Resolution: Accepted
Priority: 5
Submitted By: Sebastian Hartte (dark-storm)
>Assigned to: Walter Dörwald (doerwalter)
Summary: Incorrect behaviour of StreamReader.readline leads to crash

Initial Comment:
StreamReader.readline in codecs.py misinterprets the
size parameter. (File: codecs.py, Line: 296). 
The reported crash happens if a file processed by the
python tokenizer is using a not built-in codec for
source file encoding (i.e. iso-8859-15) and has lines
that are longer than the define BUFSIZ in stdio.h on
the platform python is compiled on. (On Windows for
MSVC++ this define is 512, thus a line that is longer
than 511 characters should suffice to crash python with
the correct encoding). 
The crash happens because the python core assumes that
the StreamReader.readline method returns a string
shorter than the platforms BUFSIZ macro (512 for MSVC). 

The current StreamReader.readline() looks like this:
---------------------------------------
    def readline(self, size=None, keepends=True):

        """ Read one line from the input stream and
return the
            decoded data.

            size, if given, is passed as size argument
to the
            read() method.

        """
        if size is None:
            size = 10
        line = u""
        while True:
            data = self.read(size)
            line += data
            pos = line.find("\n")
            if pos>=0:
                self.charbuffer = line[pos+1:] +
self.charbuffer
                if keepends:
                    line = line[:pos+1]
                else:
                    line = line[:pos]
                return line
            elif not data:
                return line
            if size<8000:
                size *= 2
---------------------------------------

However, the core expects readline() to return at most
a string of the length size. readline() instead passes
size to the read() function.

There are multiple ways of solving this issue. 

a) Make the core reallocate the token memory if the
string returned by the readline function exceeds the
expected size (risky if the line is very long). 

b) Fix, rename, remodel,  change StreamReader.readline.
If no other part of the core or code expects size to do
nothing useful, the following readline() function does
behave correctly with arbitrarily long lines:

---------------------------------------
    def readline(self, size=None, keepends=True):

        """ Read one line from the input stream and
return the
            decoded data.

            size, if given, is passed as size argument
to the
            read() method.

        """
        if size is None:
            size = 10
        data = self.read(size)
        pos = data.find("\n")
        if pos>=0:
            self.charbuffer = data[pos+1:] +
self.charbuffer
            if keepends:
                data = data[:pos+1]
            else:
                data = data[:pos]
            return data
        else:
       	    return data # Return the data directly
since otherwise 
                        # we would exceed the given size.
---------------------------------------

Reproducing this bug:
This bug can only be reproduced if your platform does
use a small BUFSIZ for stdio operations (i.e. MSVC), i
didn't check but Linux might use more than 8000 byte
for the default buffer size. That means you would have
to use a line with more than 8000 characters to
reproduce this.

In addition, this bug only shows if the python
libraries StreamReader.readline() method is used, if
internal codecs like Latin-1 are used, there is no
crash since the method isn't used.

I've attached a file that crashes my Python 2.4 on
Windows using the official binary released on
python.org today.

Last but not least here is the assertion that is broken
if python is compiled in debug mode with MSVC++ 7.1:

Assertion failed: strlen(str) < (size_t)size, file
\Python-2.4\Parser\tokenizer.
c, line 367

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-17 22:19

Message:
Logged In: YES 
user_id=38388

Looks good, please check it in with one correction:
"".join() should be u"".join().

Thanks, Walter !

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-17 19:12

Message:
Logged In: YES 
user_id=89016

Here is a second version of the patch 
(fix_readline_and_read2.diff) that implements all those 
changes. With this we get "universal newline" support for 
free. Note that there is still one corner case: If the byte 
stream is temporarily exhausted and the last character read is 
a \r, and the next character read (once the stream contains 
new data) is an \n, this will result in two lines instead of one.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-17 13:59

Message:
Logged In: YES 
user_id=38388

Walter, the fix to .read() looks OK, but the implementation
of .readline() is not correct: there are far more line break
characters in Unicode than just \n. The correct way to check
would be by using .splitlines() which does know about all
the possible line break characters in Unicode, plus it also
handles line ends such as \r\n and just \r correctly. If
.splitlines() returns a list with more than one entry you
know that you've hit a line end and you can push the
remaining entries entries back onto the .charbuffer.

Another nit: you should bump the default for readsize to 72
- the average line size used in text files.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-16 00:55

Message:
Logged In: YES 
user_id=89016

OK, here is a patch (fix_readline_and_read.diff) that keeps 
the new readline behaviour when size is None and calls read 
only once (i.e. the 2.3 behaviour) otherwise. The "read() 
cleanup patch" is also included. Running dark-storm's test 
script on Windows with the patch applied confirms that it 
fixed this particular problem (of course the tokenizer should 
still be fixed). Marc-André, if there are no objections, can I 
check this in?

----------------------------------------------------------------------

Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-03 22:45

Message:
Logged In: YES 
user_id=377356

<quote>
and puts the resulting byte string into the char * passed
in, so even if we fix the readline call the UTF-8 encoded
string might still overflow the avaliable space. How can
tokenizer.c be sure how much the foo->utf8 transcoding
shrinks or expands the string?
</quote>

Woopsie. Thats correct and i missed that. I thought we were
talking about the UTF-8 codec here. Then it's correct, that
fixing readline() alone probably wont fix the issue.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-03 21:01

Message:
Logged In: YES 
user_id=89016

> I checked the decoding_fgets function (and the enclosed 
call
> to fp_readl). The patch is more problematic than i thought
> since decoding_fgets not only takes a pointer to the token
> state but also a pointer to a destination string buffer.
> Reallocating the buffer within fp_readl would mean a very
> very bad hack since you'd have to reallocate "foreign"
> memory based on a pointer comparison (comparing the 
passed
> string buffers pointer against tok->inp || tok->buf...).

Maybe all pointers pointing into the buffer should be moved
into a struct?

> As it stands now, patching the tokenizer would mean 
changing
> the function signatures or otherwise change the structure
> (more error prone).

All the affected function seem to be static, so at least in
this regard there shouldn't be any problem.

> Another possible solution would be to
> provide a specialized readline() function for which the
> assumption that at most size bytes are returned is correct.

All the codecs would have to provide such a readline().

BTW, the more I look at your patch the more I think
that it gets us as close to the old behaviour as we
can get.

> Oh and about that UTF-8 decoding. readline()'s size
> restriction works on the already decoded string (at least it
> should), so that shouldn't be an issue.

I don't understand that. fp_readl() does the following
two calls:

buf = PyObject_Call(tok->decoding_readline, args, NULL);
utf8 = PyUnicode_AsUTF8String(buf);

and puts the resulting byte string into the char * passed
in, so even if we fix the readline call the UTF-8 encoded
string might still overflow the avaliable space. How can
tokenizer.c be sure how much the foo->utf8 transcoding
shrinks or expands the string?

> Maybe another
> optional parameter should be added to readline() called
> limit=None which doesn't limit the returned string by
> default, but does so if the parameter is a positive number.

But limit to what?

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-03 20:43

Message:
Logged In: YES 
user_id=89016

> The problem with the change is that it applies to *all*
> codecs. If only the UTF-16 codec has a problem with the
> standard logic, it should override the .readline()
> method as necessary, but this should not affect all
> the other codecs.

readline() had to be rewritten anyway to take the
bytebuffer into account. True, the bytebuffer is
only needed for UTF-8, UTF-16, UTF-16-LE, UTF-16-BE
and maybe a few others, but unless we'd introduced a
ByteBufferStreamReader that those codecs can subclass
this would have meant code duplication.

I'try to come up with a readline() patch (for the base
class readline() next week.
>
>> BTW, the logic in read() looks rather convoluted to me
>> now that a look at it a second time. Should we clean 
>> this up a bit?
>
> If that's possible, yes :-)

Attached is a patch (fixread.diff) that makes read()
a *little* simpler. Making it much simple breaks several
tests.

----------------------------------------------------------------------

Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-03 16:42

Message:
Logged In: YES 
user_id=377356

I checked the decoding_fgets function (and the enclosed call
to fp_readl). The patch is more problematic than i thought
since decoding_fgets not only takes a pointer to the token
state but also a pointer to a destination string buffer.
Reallocating the buffer within fp_readl would mean a very
very bad hack since you'd have to reallocate "foreign"
memory based on a pointer comparison (comparing the passed
string buffers pointer against tok->inp || tok->buf...).

As it stands now, patching the tokenizer would mean changing
the function signatures or otherwise change the structure
(more error prone). Another possible solution would be to
provide a specialized readline() function for which the
assumption that at most size bytes are returned is correct.

Oh and about that UTF-8 decoding. readline()'s size
restriction works on the already decoded string (at least it
should), so that shouldn't be an issue. Maybe another
optional parameter should be added to readline() called
limit=None which doesn't limit the returned string by
default, but does so if the parameter is a positive number.

Just my 0.2$

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-03 15:46

Message:
Logged In: YES 
user_id=38388

>>In Python 2.3, the size parameter was simply passed down
>>to the stream's .readline() method, so semantics were
>>defined by the stream rather than the codec.
> 
> 
> In most cases this meant that there were at most size bytes 
> read from the byte stream. It seems that tokenizer.c:fp_readl
> () assumes that, so it broke with the change.
> 
> 
>>I think that we should restore this kind of behaviour
>>for Python 2.4.1.
> 
> 
> That's not possible (switching back to calling readline()
on the 
> bytestream), because it breaks the UTF-16 decoder, but we 
> can get something that is close.

The problem with the change is that it applies to *all*
codecs. If only the UTF-16 codec has a problem with the
standard logic, it should override the .readline()
method as necessary, but this should not affect all
the other codecs.

Unless, I'm missing something, the other codecs
work just fine with the old implementation of the
method.

>>What was the reason why you introduced the change
>>in semantics ?
> 
> 
> 1) To get readline() to work with UTF-16 it's no longer 
> possible to call readline() for the byte stream. This has
to be 
> replaced by one or more calls to read().
> 
> 2) As you say size was always just a hint. With line
buffering 
> (which is required for UTF-16 readline) this hint becomes
even 
> more meaningless.

That's OK for the UTF-16 codec, but why the generic change
in the base class ?

> So I'd say:
> 
> 1) Fix tokenizer.c:fp_readl(). It looks to me like the
code had 
> a problem in this spot anyway: There's an
> 
> assert(strlen(str) < (size_t)size); /* XXX */
> 
> in the code, and a string *can* get longer when it's encoded 
> to UTF-8 which fp_readl() does.
> 
> dark-storm, if you can provide a patch for this problem, go 
> ahead.

+1

> 2) change readline(), so that calling it with a size
parameter 
> results in only one call to read(). If read() is called with 
> chars==-1 (which it will in this case), this will in turn
only call 
> read() for the byte stream once (at most). If size isn't 
> specified the caller should be able to cope with any returned
> string length, so I think the current behaviour (calling
read() 
> multiple times until a "\n" shows up) can be kept.

+1 for UTF-16

I'm not sure whether the current implementation is needed
for all other codecs.

> BTW, the logic in read() looks rather convoluted to me now 
> that a look at it a second time. Should we clean this up a
bit?

If that's possible, yes :-)

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-03 12:58

Message:
Logged In: YES 
user_id=89016

> In Python 2.3, the size parameter was simply passed down
> to the stream's .readline() method, so semantics were
> defined by the stream rather than the codec.

In most cases this meant that there were at most size bytes 
read from the byte stream. It seems that tokenizer.c:fp_readl
() assumes that, so it broke with the change.

> I think that we should restore this kind of behaviour
> for Python 2.4.1.

That's not possible (switching back to calling readline() on the 
bytestream), because it breaks the UTF-16 decoder, but we 
can get something that is close.

> What was the reason why you introduced the change
> in semantics ?

1) To get readline() to work with UTF-16 it's no longer 
possible to call readline() for the byte stream. This has to be 
replaced by one or more calls to read().

2) As you say size was always just a hint. With line buffering 
(which is required for UTF-16 readline) this hint becomes even 
more meaningless.

So I'd say:

1) Fix tokenizer.c:fp_readl(). It looks to me like the code had 
a problem in this spot anyway: There's an

assert(strlen(str) < (size_t)size); /* XXX */

in the code, and a string *can* get longer when it's encoded 
to UTF-8 which fp_readl() does.

dark-storm, if you can provide a patch for this problem, go 
ahead.

2) change readline(), so that calling it with a size parameter 
results in only one call to read(). If read() is called with 
chars==-1 (which it will in this case), this will in turn only call 
read() for the byte stream once (at most). If size isn't 
specified the caller should be able to cope with any returned
string length, so I think the current behaviour (calling read() 
multiple times until a "\n" shows up) can be kept.

BTW, the logic in read() looks rather convoluted to me now 
that a look at it a second time. Should we clean this up a bit?

----------------------------------------------------------------------

Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-02 22:38

Message:
Logged In: YES 
user_id=377356

I forgot to reply to doerwalter's comment.

*snip* I couldn't get Linux Python to crash or assert with the 
attached gurk.py, Windows Python crashed at i=70. *snip*

Linux is using a larger BUFSIZ for me. MUCH larger i have to
add. MSVC used 512 bytes for me while the c standard library
on my linux box defines BUFSIZ to be 8192 bytes long. That
should suffice for looooong lines.

----------------------------------------------------------------------

Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-02 22:35

Message:
Logged In: YES 
user_id=377356

The problem here is that the tokenizer tries to read at most
512 bytes into its token buffer and then read another 512
bytes if the line isn't complete yet. At least that is what
a quick look at the tokenizers freadln (or whatever the
function the assert is in is called) function made me think. 
The problem here is that the tokens buffer either has to be
reallocated (easy) or another solution has to be found.

If the reallocation of the tokens buffer is sufficient, i
can easily provide a patch for that.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-02 22:21

Message:
Logged In: YES 
user_id=38388

Walter, your analysis is not quite right: the size parameter
for .readline()
was always just a hint and never intended to limit the
number of bytes
returned by the method (like the size parameter in .read()).

In Python 2.3, the size parameter was simply passed down to
the stream's
.readline() method, so semantics were defined by the stream
rather than
the codec.

I think that we should restore this kind of behaviour for
Python 2.4.1.

Any code which makes assumptions on the length of the
returned data
is simply broken. If the Python parser makes such an
assumption it
should be fixed.

Again, the size parameter is just a hint for the
implementation. It
might as well ignore it completely. The reason for having
the parameter
is to limit the number of bytes read in case the stream does
not 
include line breaks - otherwise, .readline() could end up
reading the
whole file into memory.

What was the reason why you introduced the change in semantics ?

I would have complained about it, had I known about this change
(I only saw you change to .read() which was required for the
stateful
codecs).

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-02 21:58

Message:
Logged In: YES 
user_id=89016

I couldn't get Linux Python to crash or assert with the 
attached gurk.py, Windows Python crashed at i=70.

The problem we have is that Python 2.4 changed the meaning 
of the size parameter in StreamReader.readline(). There are 
at least four possible interpretations:

1) Never read more than size bytes from the underlying byte 
stream in one call to readline()

2) Never read more than size characters from the underlying 
byte stream in one call to readline()

3) If calling readline() requires reading from the underlying 
byte stream, do the reading in chunks of size bytes.

4) Never return more than size characters from readline(), if 
there's no linefeed within size characters the result is a partial 
line.

In Python 2.3 1) was used with the implicit assumption that 
this guarantees that the result will never be longer than size. 

You're patch looks like it could restore the old behaviour,  but 
we'd loose the ability to reliably read a line until a "\n" is 
available without passing size=-1 into read() with would read 
the whole stream.

----------------------------------------------------------------------

Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-01 21:36

Message:
Logged In: YES 
user_id=377356

I attached the .diff file for my patch. I hope i got this right.

----------------------------------------------------------------------

Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-01 21:32

Message:
Logged In: YES 
user_id=377356

"Can you attach a proper diff for your changes to 
StreamReader.readline()?"

I would like to, but i don't have access to a proper diff
utility on windows. I will try to get one and then attach
the diff file.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-01 20:13

Message:
Logged In: YES 
user_id=89016

What I get when I execute your test.py on Windows is:
---
Fatal Python error: GC object already tracked

This application has requested the Runtime to terminate it in 
an unusual way.
Please contact the application's support team for more 
information.
---

Is this the expected failure?

Can you attach a proper diff for your changes to 
StreamReader.readline()?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1076985&group_id=5470