[Patches] [ python-Patches-1101726 ] Patch for potential buffer overrun in tokenizer.c

SourceForge.net noreply at sourceforge.net
Tue May 17 15:47:10 CEST 2005


Patches item #1101726, was opened at 2005-01-13 06:45
Message generated for change (Comment added) made by glchapman
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1101726&group_id=5470

Category: Core (C code)
Group: Python 2.4
Status: Open
Resolution: None
Priority: 7
Submitted By: Greg Chapman (glchapman)
Assigned to: Martin v. Löwis (loewis)
Summary: Patch for potential buffer overrun in tokenizer.c

Initial Comment:
The fp_readl function in tokenizer.c has a potential
buffer overrun; see:

www.python.org/sf/1089395

It is also triggered by trying to generate the pycom
file for the Excel 11.0 typelib, which is where I ran
into it.

The attached patch allows successful generation of the
Excel file; it also runs the fail.py script from the
above report without an access violation.  It also
doesn't break any of the tests in the regression suite
(probably something like fail.py should be added as a
test).

It is not as efficient as it might be; with a function
for determining the number of unicode characters in a
utf8 string, you could avoid some memory allocations. 
Perhaps such a function should be added to unicodeobject.c?

And, of course, the patch definitely needs review.  I'm
especially concerned that my use of
tok->decoding_buffer might be violating some kind of
assumption that I missed.

----------------------------------------------------------------------

>Comment By: Greg Chapman (glchapman)
Date: 2005-05-17 05:47

Message:
Logged In: YES 
user_id=86307

Looks good to me.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-05-16 06:11

Message:
Logged In: YES 
user_id=89016

Here is another version of the patch, this version doesn't
pass any arguments to readline(), because it's no longer
neccessary to limit the line length. What do you think?

----------------------------------------------------------------------

Comment By: Greg Chapman (glchapman)
Date: 2005-03-21 04:47

Message:
Logged In: YES 
user_id=86307

You're quite right, 2.4 AV's reliably using the new
test_pep263.  The copy of python24.dll I was testing against
already had (the first version of) the patch applied. 
Anyway, I'm attaching a diff for test_pep263 (against the
original 2.4 test_pep263) which incorporates your
suggestions about not assuming the current directory.


----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-03-14 12:43

Message:
Logged In: YES 
user_id=89016

Maybe I've put the test files in the wrong spot, but what
I'm getting is:
> ./python Lib/test/test_pep263.py
test_pep263 (__main__.PEP263Test) ... ok
test_evilencoding (__main__.EvilEncodingTest) ... ok
Segmentation fault

You should probably put pep263_evilencoding.py and
pep263_longline.py alongside test_pep263.py and then find
then via os.path.join(__file__.rsplit(os.sep, 1)[0],
"pep263_longline.py")

----------------------------------------------------------------------

Comment By: Greg Chapman (glchapman)
Date: 2005-02-27 06:22

Message:
Logged In: YES 
user_id=86307

After thinking about it some more, I realized that fp_readl
doesn’t have to do any utf8 manipulation.  If the string it
copies into the buffer does not include a \n, the buffer is
expanded and fp_readl is called again until a \n is returned
(or the end of the file is reached).  It doesn’t matter if,
during this loop, the buffer contains temporarily broken
utf8 characters.  So I’m attaching another new patch.  This
copies as many characters as fit into the buffer and saves
any overflow into a new string object (stored in tok
>decoding_buffer).  When it is called again by the loop, it
uses this overflow string to copy characters into the buffer.

I’m also attaching a zip file with a modified test_pep263.py
and some ancillary files.  To test fp_readl, python needs to
read from a file.  Perhaps it would be better to generate
temporary files rather than include these extra files. 
Also, although the tests will trigger the assert using a
debug build of the 2.4 source, I’m not getting an access
violation running regrtest against a release build (of 2.4)
– perhaps different tests are in order?



----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-02-23 10:48

Message:
Logged In: YES 
user_id=89016

I think the patch looks good. Staring at it for a while I
believe the reference counts are OK. Can you incorporate
fail.py into the test suite?

----------------------------------------------------------------------

Comment By: Greg Chapman (glchapman)
Date: 2005-01-23 05:26

Message:
Logged In: YES 
user_id=86307

I'm attaching a new patch (tokenizer.c.2.diff), which should
be better since it avoids some unnecessary allocations and
decoding/encoding.  Hopefully, I haven't made any
unwarranted assumptions about UTF8.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1101726&group_id=5470


More information about the Patches mailing list