[ python-Bugs-1178484 ] Erroneous line number error in Py2.4.1

Thu Aug 11 17:22:27 CEST 2005

Bugs item #1178484, was opened at 2005-04-07 14:33
Message generated for change (Comment added) made by birkenfeld
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1178484&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Parser/Compiler
Group: Python 2.4
Status: Open
Resolution: Accepted
Priority: 5
Submitted By: Timo Linna (tilinna)
Assigned to: Martin v. Löwis (loewis)
Summary: Erroneous line number error in Py2.4.1

Initial Comment:
For some reason Python 2.3.5 reports the error in the 
following program correctly: 

  File "C:\Temp\problem.py", line 7 
SyntaxError: unknown decode error 

..whereas Python 2.4.1 reports an invalid line number: 

  File "C:\Temp\problem.py", line 2 
SyntaxError: unknown decode error 

----- problem.py starts ----- 
# -*- coding: ascii -*- 

""" 
Foo bar 
""" 

# Ä is not allowed in ascii coding 
----- problem.py ends -----

Without the encoding declaration both Python versions 
report the usual deprecation warning (just like they 
should be doing). 

My environment: Windows 2000 + SP3. 

----------------------------------------------------------------------

>Comment By: Reinhold Birkenfeld (birkenfeld)
Date: 2005-08-11 17:22

Message:
Logged In: YES 
user_id=1188172

Would it be appropriate to raise priority, then, to ensure
this doesn't get overlooked before 2.4.2?

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-08-11 16:04

Message:
Logged In: YES 
user_id=89016

Somehow I forgot to upload the patch. Here it is
(diff2.txt). I'd like this patch to go into 2.4.2.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-05-19 21:06

Message:
Logged In: YES 
user_id=89016

> Walter, as I've said before: I know that you need buffering
> for the UTF-x readline support, but I don't see a
> requirement for it in most of the other codecs

The *charbuffer* is required for readline support, but the
*bytebuffer* is required for any non-charmap codec.

To have different buffering modes we'd either need a flag in
the StreamReader or use different classes, i.e. a class
hierarchy like the following:

StreamReader
   UnbufferedStreamReader
      CharmapStreamReader
         ascii.StreamReader
         iso_8859_1.StreamReader
   BufferedStreamReader
         utf_8.StreamReader

I don't think that we should introduce such a big change in
2.4.x. Furthermore there is another problem: The 2.4
buffering code automatically gives us universal newline
support. If you have a file foo.txt containing "a\rb", with
Python 2.4 you get:

>>> list(codecs.open("foo.txt", "rb", "latin-1"))
[u'a\r', u'b']

But with Python 2.3 you get:

>>> list(codecs.open("foo.txt", "rb", "latin-1"))
[u'a\rb']

If we would switch to the old StreamReader for the charmap
codecs, suddenly the stream reader for e.g. latin-1 and
UTF-8 would behave differently. Of course we could change
the buffering stream reader to only split lines on "\n", but
this would change functionality again.

> Your argument about applications making implications on the
> file position after having used .readline() is true, but
> still many applications rely on this behavior which is not
> as far fetched as it may seem given that they normally only
> expect 8-bit data.

If an application doesn't mix calls to read() with calls to
readline() (or different size values in these calls), the
change in behaviour from 2.3 to 2.4 shouldn't be this big.

No matter what we decide for the codecs, the tokenizer is
broken and should be fixed.

> Wouldn't it make things a lot safer if we only use buffering
> per default in the UTF-x codecs and revert back to the old
> non-buffered behavior for the other codecs which has worked
> well in the past ?!

Only if we'd drop the additional functionality added in 2.4.
(universal
newline support, the chars argument for read() and the
keepends argument for readline().), which I think could only
be done for 2.5.

> About your patch:
>
> * Please explain what firstline is supposed to do
> (preferably in the doc-string).

OK, I've added an explanation in the docstring.

> * Why is firstline always set in .readline() ?

firstline is only supposed to be used by readline(). We could
rename the argument to _firstline to make it clear that this is
a private parameter, or introduce a new method _read() that
has a firstline parameter. Then read() calls _read() with
firstline==False and readline() calls _read() with
firstline==True.

The purpose of firstline is to make sure that if an input
stream has
its first decoding error in line n, that the
UnicodeDecodeError will only be raised by the n'th call to
readline().

> * Please remove the print repr()

OK, done.

> * You cannot always be sure that exc has a .start attribute,
> so you need to accomocate for this situation as well

I don't understand that. A UnicodeDecodeError is created by
PyUnicodeDecodeError_Create() in exceptions.c, so any
UnicodeDecodeError instance without a start attribute would
be severely broken.

Thanks for reviewing the patch.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-05-18 11:31

Message:
Logged In: YES 
user_id=38388

Walter, as I've said before: I know that you need buffering
for the UTF-x readline support, but I don't see a
requirement for it in most of the other codecs. E.g. an
ascii codec or latin-1 codec will only ever see standard
line ends (not Unicode ones), so the streams .readline()
method can be used directly - just like we did before the
buffering code was added.

Your argument about applications making implications on the
file position after having used .readline() is true, but
still many applications rely on this behavior which is not
as far fetched as it may seem given that they normally only
expect 8-bit data.

Wouldn't it make things a lot safer if we only use buffering
per default in the UTF-x codecs and revert back to the old
non-buffered behavior for the other codecs which has worked
well in the past ?!

About your patch:

* Please explain what firstline is supposed to do
(preferably in the doc-string).
* Why is firstline always set in .readline() ?
* Please remove the print repr()
* You cannot always be sure that exc has a .start attribute,
so you need to accomocate for this situation as well

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-05-17 18:50

Message:
Logged In: YES 
user_id=89016

It isn't the buffering support per se that breaks the
tokenizer. This problem exists even in Python 2.3.x (Simply
try the test scripts from http://www.python.org/sf/1089395
with Python 2.3.5 and you'll get a segfault). Applications
that rely on len(readline(x)) == x or anything similar are
broken anyway. Supporting buffered and unbuffered reading
would mean keeping the 2.3 mode of doing things around
indefinitely, and we'd loose readline() support for UTF-16
again.

BTW, applying Greg Chapman's patch
(http://www.python.org/sf/1101726, which fixes the
tokenizer) together with this one seems to fix the problem
from my previous post. So if you could give
http://www.python.org/sf/1101726 a third look, so we can get
it into 2.4.2, this would be great.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2005-05-17 11:13

Message:
Logged In: YES 
user_id=38388

Walter, I think that instead of trying to get the tokenizer
to work with the buffer support in the codecs, you should
add a flag that allows to switch off the buffer support in
the codecs altogether and then use the unbuffered mode
codecs in the tokenizer.

I expect that other applications will run into the same kind
of problem, so it should be possible to switch off buffering
if needed (maybe we should make this the default ?!).

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-05-16 10:35

Message:
Logged In: YES 
user_id=89016

OK, here is a patch. It adds an additional argument
firstline to read(). If this argument is true (i.e. if
called from readline()) and a decoding error happens, this
error will only be reported if it is in the first line.
Otherwise read() will decode up to the error position and
put the rest in the bytebuffer.

Unfortunately with this patch, I get a segfault with the
following stacktrace if I run the test. I don't know if this
is related to bug #1089395/patch #1101726. Martin, can you
take a look?

#0  0x08057ad1 in tok_nextc (tok=0x81ca7b0) at tokenizer.c:719
#1  0x08058558 in tok_get (tok=0x81ca7b0,
p_start=0xbffff3d4, p_end=0xbffff3d0) at tokenizer.c:1075
#2  0x08059331 in PyTokenizer_Get (tok=0x81ca7b0,
p_start=0xbffff3d4, p_end=0xbffff3d0) at tokenizer.c:1466
#3  0x080561b1 in parsetok (tok=0x81ca7b0, g=0x8167980,
start=257, err_ret=0xbffff440, flags=0) at parsetok.c:125
#4  0x0805613c in PyParser_ParseFileFlags (fp=0x816bdb8,
filename=0xbffff7b7 "./bug.py", g=0x8167980, start=257,
ps1=0x0, ps2=0x0, 
    err_ret=0xbffff440, flags=0) at parsetok.c:90
#5  0x080f3926 in PyParser_SimpleParseFileFlags
(fp=0x816bdb8, filename=0xbffff7b7 "./bug.py", start=257,
flags=0)
    at pythonrun.c:1345
#6  0x080f352b in PyRun_FileExFlags (fp=0x816bdb8,
filename=0xbffff7b7 "./bug.py", start=257, globals=0xb7d62e94, 
    locals=0xb7d62e94, closeit=1, flags=0xbffff544) at
pythonrun.c:1239
#7  0x080f22f2 in PyRun_SimpleFileExFlags (fp=0x816bdb8,
filename=0xbffff7b7 "./bug.py", closeit=1, flags=0xbffff544)
    at pythonrun.c:860
#8  0x080f1b16 in PyRun_AnyFileExFlags (fp=0x816bdb8,
filename=0xbffff7b7 "./bug.py", closeit=1, flags=0xbffff544)
    at pythonrun.c:664
#9  0x08055e45 in Py_Main (argc=2, argv=0xbffff5f4) at
main.c:484
#10 0x08055366 in main (argc=2, argv=0xbffff5f4) at python.c:23

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2005-04-07 16:28

Message:
Logged In: YES 
user_id=89016

The reason for this is the new codec buffering code in 2.4:
The codec might read and decode more data from the byte
stream than is neccessary for decoding one line. I.e. when
reading line n, the codec might decode bytes that belong to
line n+1, n+2 etc. too. If there's a decoding error in this
data, line n gets reported. I don't think there's a simple
fix for this.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1178484&group_id=5470