[ python-Bugs-1076985 ] Incorrect behaviour of
StreamReader.readline leads to crash
SourceForge.net
noreply at sourceforge.net
Fri Dec 3 21:01:00 CET 2004
Bugs item #1076985, was opened at 2004-12-01 19:51
Message generated for change (Comment added) made by doerwalter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1076985&group_id=5470
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Sebastian Hartte (dark-storm)
Assigned to: Nobody/Anonymous (nobody)
Summary: Incorrect behaviour of StreamReader.readline leads to crash
Initial Comment:
StreamReader.readline in codecs.py misinterprets the
size parameter. (File: codecs.py, Line: 296).
The reported crash happens if a file processed by the
python tokenizer is using a not built-in codec for
source file encoding (i.e. iso-8859-15) and has lines
that are longer than the define BUFSIZ in stdio.h on
the platform python is compiled on. (On Windows for
MSVC++ this define is 512, thus a line that is longer
than 511 characters should suffice to crash python with
the correct encoding).
The crash happens because the python core assumes that
the StreamReader.readline method returns a string
shorter than the platforms BUFSIZ macro (512 for MSVC).
The current StreamReader.readline() looks like this:
---------------------------------------
def readline(self, size=None, keepends=True):
""" Read one line from the input stream and
return the
decoded data.
size, if given, is passed as size argument
to the
read() method.
"""
if size is None:
size = 10
line = u""
while True:
data = self.read(size)
line += data
pos = line.find("\n")
if pos>=0:
self.charbuffer = line[pos+1:] +
self.charbuffer
if keepends:
line = line[:pos+1]
else:
line = line[:pos]
return line
elif not data:
return line
if size<8000:
size *= 2
---------------------------------------
However, the core expects readline() to return at most
a string of the length size. readline() instead passes
size to the read() function.
There are multiple ways of solving this issue.
a) Make the core reallocate the token memory if the
string returned by the readline function exceeds the
expected size (risky if the line is very long).
b) Fix, rename, remodel, change StreamReader.readline.
If no other part of the core or code expects size to do
nothing useful, the following readline() function does
behave correctly with arbitrarily long lines:
---------------------------------------
def readline(self, size=None, keepends=True):
""" Read one line from the input stream and
return the
decoded data.
size, if given, is passed as size argument
to the
read() method.
"""
if size is None:
size = 10
data = self.read(size)
pos = data.find("\n")
if pos>=0:
self.charbuffer = data[pos+1:] +
self.charbuffer
if keepends:
data = data[:pos+1]
else:
data = data[:pos]
return data
else:
return data # Return the data directly
since otherwise
# we would exceed the given size.
---------------------------------------
Reproducing this bug:
This bug can only be reproduced if your platform does
use a small BUFSIZ for stdio operations (i.e. MSVC), i
didn't check but Linux might use more than 8000 byte
for the default buffer size. That means you would have
to use a line with more than 8000 characters to
reproduce this.
In addition, this bug only shows if the python
libraries StreamReader.readline() method is used, if
internal codecs like Latin-1 are used, there is no
crash since the method isn't used.
I've attached a file that crashes my Python 2.4 on
Windows using the official binary released on
python.org today.
Last but not least here is the assertion that is broken
if python is compiled in debug mode with MSVC++ 7.1:
Assertion failed: strlen(str) < (size_t)size, file
\Python-2.4\Parser\tokenizer.
c, line 367
----------------------------------------------------------------------
>Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-03 21:01
Message:
Logged In: YES
user_id=89016
> I checked the decoding_fgets function (and the enclosed
call
> to fp_readl). The patch is more problematic than i thought
> since decoding_fgets not only takes a pointer to the token
> state but also a pointer to a destination string buffer.
> Reallocating the buffer within fp_readl would mean a very
> very bad hack since you'd have to reallocate "foreign"
> memory based on a pointer comparison (comparing the
passed
> string buffers pointer against tok->inp || tok->buf...).
Maybe all pointers pointing into the buffer should be moved
into a struct?
> As it stands now, patching the tokenizer would mean
changing
> the function signatures or otherwise change the structure
> (more error prone).
All the affected function seem to be static, so at least in
this regard there shouldn't be any problem.
> Another possible solution would be to
> provide a specialized readline() function for which the
> assumption that at most size bytes are returned is correct.
All the codecs would have to provide such a readline().
BTW, the more I look at your patch the more I think
that it gets us as close to the old behaviour as we
can get.
> Oh and about that UTF-8 decoding. readline()'s size
> restriction works on the already decoded string (at least it
> should), so that shouldn't be an issue.
I don't understand that. fp_readl() does the following
two calls:
buf = PyObject_Call(tok->decoding_readline, args, NULL);
utf8 = PyUnicode_AsUTF8String(buf);
and puts the resulting byte string into the char * passed
in, so even if we fix the readline call the UTF-8 encoded
string might still overflow the avaliable space. How can
tokenizer.c be sure how much the foo->utf8 transcoding
shrinks or expands the string?
> Maybe another
> optional parameter should be added to readline() called
> limit=None which doesn't limit the returned string by
> default, but does so if the parameter is a positive number.
But limit to what?
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-03 20:43
Message:
Logged In: YES
user_id=89016
> The problem with the change is that it applies to *all*
> codecs. If only the UTF-16 codec has a problem with the
> standard logic, it should override the .readline()
> method as necessary, but this should not affect all
> the other codecs.
readline() had to be rewritten anyway to take the
bytebuffer into account. True, the bytebuffer is
only needed for UTF-8, UTF-16, UTF-16-LE, UTF-16-BE
and maybe a few others, but unless we'd introduced a
ByteBufferStreamReader that those codecs can subclass
this would have meant code duplication.
I'try to come up with a readline() patch (for the base
class readline() next week.
>
>> BTW, the logic in read() looks rather convoluted to me
>> now that a look at it a second time. Should we clean
>> this up a bit?
>
> If that's possible, yes :-)
Attached is a patch (fixread.diff) that makes read()
a *little* simpler. Making it much simple breaks several
tests.
----------------------------------------------------------------------
Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-03 16:42
Message:
Logged In: YES
user_id=377356
I checked the decoding_fgets function (and the enclosed call
to fp_readl). The patch is more problematic than i thought
since decoding_fgets not only takes a pointer to the token
state but also a pointer to a destination string buffer.
Reallocating the buffer within fp_readl would mean a very
very bad hack since you'd have to reallocate "foreign"
memory based on a pointer comparison (comparing the passed
string buffers pointer against tok->inp || tok->buf...).
As it stands now, patching the tokenizer would mean changing
the function signatures or otherwise change the structure
(more error prone). Another possible solution would be to
provide a specialized readline() function for which the
assumption that at most size bytes are returned is correct.
Oh and about that UTF-8 decoding. readline()'s size
restriction works on the already decoded string (at least it
should), so that shouldn't be an issue. Maybe another
optional parameter should be added to readline() called
limit=None which doesn't limit the returned string by
default, but does so if the parameter is a positive number.
Just my 0.2$
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-03 15:46
Message:
Logged In: YES
user_id=38388
>>In Python 2.3, the size parameter was simply passed down
>>to the stream's .readline() method, so semantics were
>>defined by the stream rather than the codec.
>
>
> In most cases this meant that there were at most size bytes
> read from the byte stream. It seems that tokenizer.c:fp_readl
> () assumes that, so it broke with the change.
>
>
>>I think that we should restore this kind of behaviour
>>for Python 2.4.1.
>
>
> That's not possible (switching back to calling readline()
on the
> bytestream), because it breaks the UTF-16 decoder, but we
> can get something that is close.
The problem with the change is that it applies to *all*
codecs. If only the UTF-16 codec has a problem with the
standard logic, it should override the .readline()
method as necessary, but this should not affect all
the other codecs.
Unless, I'm missing something, the other codecs
work just fine with the old implementation of the
method.
>>What was the reason why you introduced the change
>>in semantics ?
>
>
> 1) To get readline() to work with UTF-16 it's no longer
> possible to call readline() for the byte stream. This has
to be
> replaced by one or more calls to read().
>
> 2) As you say size was always just a hint. With line
buffering
> (which is required for UTF-16 readline) this hint becomes
even
> more meaningless.
That's OK for the UTF-16 codec, but why the generic change
in the base class ?
> So I'd say:
>
> 1) Fix tokenizer.c:fp_readl(). It looks to me like the
code had
> a problem in this spot anyway: There's an
>
> assert(strlen(str) < (size_t)size); /* XXX */
>
> in the code, and a string *can* get longer when it's encoded
> to UTF-8 which fp_readl() does.
>
> dark-storm, if you can provide a patch for this problem, go
> ahead.
+1
> 2) change readline(), so that calling it with a size
parameter
> results in only one call to read(). If read() is called with
> chars==-1 (which it will in this case), this will in turn
only call
> read() for the byte stream once (at most). If size isn't
> specified the caller should be able to cope with any returned
> string length, so I think the current behaviour (calling
read()
> multiple times until a "\n" shows up) can be kept.
+1 for UTF-16
I'm not sure whether the current implementation is needed
for all other codecs.
> BTW, the logic in read() looks rather convoluted to me now
> that a look at it a second time. Should we clean this up a
bit?
If that's possible, yes :-)
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-03 12:58
Message:
Logged In: YES
user_id=89016
> In Python 2.3, the size parameter was simply passed down
> to the stream's .readline() method, so semantics were
> defined by the stream rather than the codec.
In most cases this meant that there were at most size bytes
read from the byte stream. It seems that tokenizer.c:fp_readl
() assumes that, so it broke with the change.
> I think that we should restore this kind of behaviour
> for Python 2.4.1.
That's not possible (switching back to calling readline() on the
bytestream), because it breaks the UTF-16 decoder, but we
can get something that is close.
> What was the reason why you introduced the change
> in semantics ?
1) To get readline() to work with UTF-16 it's no longer
possible to call readline() for the byte stream. This has to be
replaced by one or more calls to read().
2) As you say size was always just a hint. With line buffering
(which is required for UTF-16 readline) this hint becomes even
more meaningless.
So I'd say:
1) Fix tokenizer.c:fp_readl(). It looks to me like the code had
a problem in this spot anyway: There's an
assert(strlen(str) < (size_t)size); /* XXX */
in the code, and a string *can* get longer when it's encoded
to UTF-8 which fp_readl() does.
dark-storm, if you can provide a patch for this problem, go
ahead.
2) change readline(), so that calling it with a size parameter
results in only one call to read(). If read() is called with
chars==-1 (which it will in this case), this will in turn only call
read() for the byte stream once (at most). If size isn't
specified the caller should be able to cope with any returned
string length, so I think the current behaviour (calling read()
multiple times until a "\n" shows up) can be kept.
BTW, the logic in read() looks rather convoluted to me now
that a look at it a second time. Should we clean this up a bit?
----------------------------------------------------------------------
Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-02 22:38
Message:
Logged In: YES
user_id=377356
I forgot to reply to doerwalter's comment.
*snip* I couldn't get Linux Python to crash or assert with the
attached gurk.py, Windows Python crashed at i=70. *snip*
Linux is using a larger BUFSIZ for me. MUCH larger i have to
add. MSVC used 512 bytes for me while the c standard library
on my linux box defines BUFSIZ to be 8192 bytes long. That
should suffice for looooong lines.
----------------------------------------------------------------------
Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-02 22:35
Message:
Logged In: YES
user_id=377356
The problem here is that the tokenizer tries to read at most
512 bytes into its token buffer and then read another 512
bytes if the line isn't complete yet. At least that is what
a quick look at the tokenizers freadln (or whatever the
function the assert is in is called) function made me think.
The problem here is that the tokens buffer either has to be
reallocated (easy) or another solution has to be found.
If the reallocation of the tokens buffer is sufficient, i
can easily provide a patch for that.
----------------------------------------------------------------------
Comment By: M.-A. Lemburg (lemburg)
Date: 2004-12-02 22:21
Message:
Logged In: YES
user_id=38388
Walter, your analysis is not quite right: the size parameter
for .readline()
was always just a hint and never intended to limit the
number of bytes
returned by the method (like the size parameter in .read()).
In Python 2.3, the size parameter was simply passed down to
the stream's
.readline() method, so semantics were defined by the stream
rather than
the codec.
I think that we should restore this kind of behaviour for
Python 2.4.1.
Any code which makes assumptions on the length of the
returned data
is simply broken. If the Python parser makes such an
assumption it
should be fixed.
Again, the size parameter is just a hint for the
implementation. It
might as well ignore it completely. The reason for having
the parameter
is to limit the number of bytes read in case the stream does
not
include line breaks - otherwise, .readline() could end up
reading the
whole file into memory.
What was the reason why you introduced the change in semantics ?
I would have complained about it, had I known about this change
(I only saw you change to .read() which was required for the
stateful
codecs).
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-02 21:58
Message:
Logged In: YES
user_id=89016
I couldn't get Linux Python to crash or assert with the
attached gurk.py, Windows Python crashed at i=70.
The problem we have is that Python 2.4 changed the meaning
of the size parameter in StreamReader.readline(). There are
at least four possible interpretations:
1) Never read more than size bytes from the underlying byte
stream in one call to readline()
2) Never read more than size characters from the underlying
byte stream in one call to readline()
3) If calling readline() requires reading from the underlying
byte stream, do the reading in chunks of size bytes.
4) Never return more than size characters from readline(), if
there's no linefeed within size characters the result is a partial
line.
In Python 2.3 1) was used with the implicit assumption that
this guarantees that the result will never be longer than size.
You're patch looks like it could restore the old behaviour, but
we'd loose the ability to reliably read a line until a "\n" is
available without passing size=-1 into read() with would read
the whole stream.
----------------------------------------------------------------------
Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-01 21:36
Message:
Logged In: YES
user_id=377356
I attached the .diff file for my patch. I hope i got this right.
----------------------------------------------------------------------
Comment By: Sebastian Hartte (dark-storm)
Date: 2004-12-01 21:32
Message:
Logged In: YES
user_id=377356
"Can you attach a proper diff for your changes to
StreamReader.readline()?"
I would like to, but i don't have access to a proper diff
utility on windows. I will try to get one and then attach
the diff file.
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2004-12-01 20:13
Message:
Logged In: YES
user_id=89016
What I get when I execute your test.py on Windows is:
---
Fatal Python error: GC object already tracked
This application has requested the Runtime to terminate it in
an unusual way.
Please contact the application's support team for more
information.
---
Is this the expected failure?
Can you attach a proper diff for your changes to
StreamReader.readline()?
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1076985&group_id=5470
More information about the Python-bugs-list
mailing list