[Python-bugs-list] [ python-Bugs-453059 ] Nasty bug in HTMLParser.py

noreply@sourceforge.net noreply@sourceforge.net
Mon, 20 Aug 2001 02:26:26 -0700


Bugs item #453059, was opened at 2001-08-19 13:41
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=453059&group_id=5470

Category: Python Library
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Chris Withers (fresh)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: Nasty bug in HTMLParser.py

Initial Comment:
If you feed the following string to an HTMLParser
parser, you get _very_ weird results:

'one & two & three &three; &blagh ;'

What I would expect would be:

 - call to handle_data(data='one & two & three ')

 - call to handle_entityref(name='three')

 - call to handle_data(data=' &blagh ;')

What you actually get is:

 - call to handle_data(data='one ')

 - call to handle_data(data='one ')

...which is very wrong :-S

Now, I'm not sure of the validity of the associated
HTML*, but if it's invalid, I would have thought
exceptions would be thrown rather than the above result.

In any case, I have a module that demonstrates this
problem which is available from:

http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/squishdot/stripogram/

It has a testsuite that runs with Zope's testrunner.py
and I just added a test to demonstrate this problem.

Any help would be very much appreciated...

Chris

* The string 'one & two & three &three; &blagh ;'
displays exactly as is in Mozilla, IE and Netscape, of
course that doesn't mean the W3C will like it ;-) I'd
prefer to go with the majority rather than being
'right' on this one.




----------------------------------------------------------------------

>Comment By: Chris Withers (fresh)
Date: 2001-08-20 02:26

Message:
Logged In: YES 
user_id=24723

Here's the patch to fix it:

28,29c28,29
< entityref =
re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')
< charref =
re.compile('&#(?:[0-9]+|[xX][0-9a-fA-F]+)[^0-9a-fA-F]')
---
> entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*);')
> charref = re.compile('&#(?:[0-9]+|[xX][0-9a-fA-F]+);')
213,214d212
<                     if rawdata[k-1] != ';':
<                         k = k-1
222,223d219
<                     if rawdata[k-1] != ';':
<                         k = k-1


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=453059&group_id=5470