[ python-Bugs-1076985 ] Incorrect behaviour of StreamReader.readline leads to crash

Thu Dec 2 22:21:40 CET 2004

Bugs item #1076985, was opened at 2004-12-01 19:51
Message generated for change (Comment added) made by lemburg
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1076985&group_id=5470

Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Sebastian Hartte (dark-storm)
Assigned to: Nobody/Anonymous (nobody)
Summary: Incorrect behaviour of StreamReader.readline leads to crash

Initial Comment:
StreamReader.readline in codecs.py misinterprets the
size parameter. (File: codecs.py, Line: 296). 
The reported crash happens if a file processed by the
python tokenizer is using a not built-in codec for
source file encoding (i.e. iso-8859-15) and has lines
that are longer than the define BUFSIZ in stdio.h on
the platform python is compiled on. (On Windows for
MSVC++ this define is 512, thus a line that is longer
than 511 characters should suffice to crash python with
the correct encoding). 
The crash happens because the python core assumes that
the StreamReader.readline method returns a string
shorter than the platforms BUFSIZ macro (512 for MSVC). 

The current StreamReader.readline() looks like this:
---------------------------------------
    def readline(self, size=None, keepends=True):

        """ Read one line from the input stream and
return the
            decoded data.

            size, if given, is passed as size argument
to the
            read() method.

        """
        if size is None:
            size = 10
        line = u""
        while True:
            data = self.read(size)
            line += data
            pos = line.find("\n")
            if pos>=0:
                self.charbuffer = line[pos+1:] +
self.charbuffer
                if keepends:
                    line = line[:pos+1]
                else:
                    line = line[:pos]
                return line
            elif not data:
                return line
            if size<8000:
                size *= 2
---------------------------------------

However, the core expects readline() to return at most
a string of the length size. readline() instead passes
size to the read() function.

There are multiple ways of solving this issue. 

a) Make the core reallocate the token memory if the
string returned by the readline function exceeds the
expected size (risky if the line is very long). 

b) Fix, rename, remodel,  change StreamReader.readline.
If no other part of the core or code expects size to do
nothing useful, the following readline() function does
behave correctly with arbitrarily long lines:

---------------------------------------
    def readline(self, size=None, keepends=True):

        """ Read one line from the input stream and
return the
            decoded data.

            size, if given, is passed as size argument
to the
            read() method.

        """
        if size is None:
            size = 10
        data = self.read(size)
        pos = data.find("\n")
        if pos>=0:
            self.charbuffer = data[pos+1:] +
self.charbuffer
            if keepends:
                data = data[:pos+1]
            else:
                data = data[:pos]
            return data
        else:
       	    return data # Return the data directly
since otherwise 
                        # we would exceed the given size.
---------------------------------------

Reproducing this bug:
This bug can only be reproduced if your platform does
use a small BUFSIZ for stdio operations (i.e. MSVC), i
didn't check but Linux might use more than 8000 byte
for the default buffer size. That means you would have
to use a line with more than 8000 characters to
reproduce this.

In addition, this bug only shows if the python
libraries StreamReader.readline() method is used, if
internal codecs like Latin-1 are used, there is no
crash since the method isn't used.

I've attached a file that crashes my Python 2.4 on
Windows using the official binary released on
python.org today.

Last but not least here is the assertion that is broken
if python is compiled in debug mode with MSVC++ 7.1:

Assertion failed: strlen(str) < (size_t)size, file
\Python-2.4\Parser\tokenizer.
c, line 367

----------------------------------------------------------------------

>Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-02 22:21

Message:
Logged In: YES 
user_id=38388

Walter, your analysis is not quite right: the size parameter
for .readline()
was always just a hint and never intended to limit the
number of bytes
returned by the method (like the size parameter in .read()).

In Python 2.3, the size parameter was simply passed down to
the stream's
.readline() method, so semantics were defined by the stream
rather than
the codec.

I think that we should restore this kind of behaviour for
Python 2.4.1.

Any code which makes assumptions on the length of the
returned data
is simply broken. If the Python parser makes such an
assumption it
should be fixed.

Again, the size parameter is just a hint for the
implementation. It
might as well ignore it completely. The reason for having
the parameter
is to limit the number of bytes read in case the stream does
not 
include line breaks - otherwise, .readline() could end up
reading the
whole file into memory.

What was the reason why you introduced the change in semantics ?

I would have complained about it, had I known about this change
(I only saw you change to .read() which was required for the
stateful
codecs).

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-02 21:58

Message:
Logged In: YES 
user_id=89016

I couldn't get Linux Python to crash or assert with the 
attached gurk.py, Windows Python crashed at i=70.

The problem we have is that Python 2.4 changed the meaning 
of the size parameter in StreamReader.readline(). There are 
at least four possible interpretations:

1) Never read more than size bytes from the underlying byte 
stream in one call to readline()

2) Never read more than size characters from the underlying 
byte stream in one call to readline()

3) If calling readline() requires reading from the underlying 
byte stream, do the reading in chunks of size bytes.

4) Never return more than size characters from readline(), if 
there's no linefeed within size characters the result is a partial 
line.

In Python 2.3 1) was used with the implicit assumption that 
this guarantees that the result will never be longer than size. 

You're patch looks like it could restore the old behaviour,  but 
we'd loose the ability to reliably read a line until a "\n" is 
available without passing size=-1 into read() with would read 
the whole stream.

----------------------------------------------------------------------

Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-01 21:36

Message:
Logged In: YES 
user_id=377356

I attached the .diff file for my patch. I hope i got this right.

----------------------------------------------------------------------

Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-01 21:32

Message:
Logged In: YES 
user_id=377356

"Can you attach a proper diff for your changes to 
StreamReader.readline()?"

I would like to, but i don't have access to a proper diff
utility on windows. I will try to get one and then attach
the diff file.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-01 20:13

Message:
Logged In: YES 
user_id=89016

What I get when I execute your test.py on Windows is:
---
Fatal Python error: GC object already tracked

This application has requested the Runtime to terminate it in 
an unusual way.
Please contact the application's support team for more 
information.
---

Is this the expected failure?

Can you attach a proper diff for your changes to 
StreamReader.readline()?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1076985&group_id=5470