[ python-Bugs-1076985 ] Incorrect behaviour of StreamReader.readline leads to crash

Wed Dec 1 21:36:20 CET 2004

Bugs item #1076985, was opened at 2004-12-01 19:51
Message generated for change (Comment added) made by dark-storm
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1076985&group_id=5470

Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Sebastian Hartte (dark-storm)
Assigned to: Nobody/Anonymous (nobody)
Summary: Incorrect behaviour of StreamReader.readline leads to crash

Initial Comment:
StreamReader.readline in codecs.py misinterprets the
size parameter. (File: codecs.py, Line: 296). 
The reported crash happens if a file processed by the
python tokenizer is using a not built-in codec for
source file encoding (i.e. iso-8859-15) and has lines
that are longer than the define BUFSIZ in stdio.h on
the platform python is compiled on. (On Windows for
MSVC++ this define is 512, thus a line that is longer
than 511 characters should suffice to crash python with
the correct encoding). 
The crash happens because the python core assumes that
the StreamReader.readline method returns a string
shorter than the platforms BUFSIZ macro (512 for MSVC). 

The current StreamReader.readline() looks like this:
---------------------------------------
    def readline(self, size=None, keepends=True):

        """ Read one line from the input stream and
return the
            decoded data.

            size, if given, is passed as size argument
to the
            read() method.

        """
        if size is None:
            size = 10
        line = u""
        while True:
            data = self.read(size)
            line += data
            pos = line.find("\n")
            if pos>=0:
                self.charbuffer = line[pos+1:] +
self.charbuffer
                if keepends:
                    line = line[:pos+1]
                else:
                    line = line[:pos]
                return line
            elif not data:
                return line
            if size<8000:
                size *= 2
---------------------------------------

However, the core expects readline() to return at most
a string of the length size. readline() instead passes
size to the read() function.

There are multiple ways of solving this issue. 

a) Make the core reallocate the token memory if the
string returned by the readline function exceeds the
expected size (risky if the line is very long). 

b) Fix, rename, remodel,  change StreamReader.readline.
If no other part of the core or code expects size to do
nothing useful, the following readline() function does
behave correctly with arbitrarily long lines:

---------------------------------------
    def readline(self, size=None, keepends=True):

        """ Read one line from the input stream and
return the
            decoded data.

            size, if given, is passed as size argument
to the
            read() method.

        """
        if size is None:
            size = 10
        data = self.read(size)
        pos = data.find("\n")
        if pos>=0:
            self.charbuffer = data[pos+1:] +
self.charbuffer
            if keepends:
                data = data[:pos+1]
            else:
                data = data[:pos]
            return data
        else:
       	    return data # Return the data directly
since otherwise 
                        # we would exceed the given size.
---------------------------------------

Reproducing this bug:
This bug can only be reproduced if your platform does
use a small BUFSIZ for stdio operations (i.e. MSVC), i
didn't check but Linux might use more than 8000 byte
for the default buffer size. That means you would have
to use a line with more than 8000 characters to
reproduce this.

In addition, this bug only shows if the python
libraries StreamReader.readline() method is used, if
internal codecs like Latin-1 are used, there is no
crash since the method isn't used.

I've attached a file that crashes my Python 2.4 on
Windows using the official binary released on
python.org today.

Last but not least here is the assertion that is broken
if python is compiled in debug mode with MSVC++ 7.1:

Assertion failed: strlen(str) < (size_t)size, file
\Python-2.4\Parser\tokenizer.
c, line 367

----------------------------------------------------------------------

>Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-01 21:36

Message:
Logged In: YES 
user_id=377356

I attached the .diff file for my patch. I hope i got this right.

----------------------------------------------------------------------

Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-01 21:32

Message:
Logged In: YES 
user_id=377356

"Can you attach a proper diff for your changes to 
StreamReader.readline()?"

I would like to, but i don't have access to a proper diff
utility on windows. I will try to get one and then attach
the diff file.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-01 20:13

Message:
Logged In: YES 
user_id=89016

What I get when I execute your test.py on Windows is:
---
Fatal Python error: GC object already tracked

This application has requested the Runtime to terminate it in 
an unusual way.
Please contact the application's support team for more 
information.
---

Is this the expected failure?

Can you attach a proper diff for your changes to 
StreamReader.readline()?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1076985&group_id=5470