[Python-bugs-list] [ python-Bugs-500073 ] HTMLParser fail to handle '&foobar'
noreply@sourceforge.net
noreply@sourceforge.net
Tue, 08 Jan 2002 20:42:39 -0800
Bugs item #500073, was opened at 2002-01-06 00:06
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=500073&group_id=5470
Category: Extension Modules
Group: Python 2.1.1
Status: Open
Resolution: None
Priority: 5
Submitted By: Bernard YUE (berniey)
>Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: HTMLParser fail to handle '&foobar'
Initial Comment:
HTMLParser did not distingish between &foobar; and
&foobar. The later is still considered as a
charref/entityref. Below is my posposed fix:
File: sgmllib.py
# SGMLParser.goahead()
# line 162-176
# from
elif rawdata[i] == '&':
match = charref.match(rawdata, i)
if match:
name = match.group(1)
self.handle_charref(name)
i = match.end(0)
if rawdata[i-1] != ';': i = i-1
continue
match = entityref.match(rawdata, i)
if match:
name = match.group(1)
self.handle_entityref(name)
i = match.end(0)
if rawdata[i-1] != ';': i = i-1
continue
# to
elif rawdata[i] == '&'
match = charref.match(rawdata, i)
if match:
if rawdata[match.end(0)-1] != ';':
# not really an charref
self.handle_data(rawdata[i])
i = i+1
else:
name = match.group(1)
self.handle_charref(name)
i = match.end(0)
continue
match = entityref.match(rawdata, i)
if match:
if rawdata[match.end(0)-1] != ';':
# not really an entitiyref
self.handle_data(rawdata[i])
i = i+1
else:
name = match.group(1)
self.handle_entityref(name)
i = match.end(0)
continue
----------------------------------------------------------------------
>Comment By: Guido van Rossum (gvanrossum)
Date: 2002-01-08 20:42
Message:
Logged In: YES
user_id=6380
I'm reassigning this to Fred.
In 2.2, the new HTMLParser may or may not still have this
problem.
In 2.1.2, I think that "fixing" it would be too big a risk
of breaking existing code, so I think it should not be
fixed.
----------------------------------------------------------------------
Comment By: Skip Montanaro (montanaro)
Date: 2002-01-08 20:33
Message:
Logged In: YES
user_id=44345
Bernie,
I tried your patch. It looks good to me. I was a tad
confused
when I first read your bug report. I thought
you were suggesting that
"&foo" be interpreted as a
charref/entityref. Instead you are
tightening up the
parser.
That seems reasonable to me. Martin, what
do you think?
Skip
----------------------------------------------------------------------
Comment By: Bernard YUE (berniey)
Date: 2002-01-08 17:04
Message:
Logged In: YES
user_id=419276
Hi again,
I just run the test.html with w3c's HTML validator. &you does
indeed treated as an invalid entityref in HTML 4.01. I've displays
test.html under IE, Netscape and Konqueror and it all gave the
result I've expected. I am not sure if sgmllib.py should stick with
the standard or go with the general defacto interpretation.
But I think it is more sensable to treat &you as text.
Bernie
----------------------------------------------------------------------
Comment By: Bernard YUE (berniey)
Date: 2002-01-08 16:43
Message:
Logged In: YES
user_id=419276
Hi Martin and Skip,
Sorry for not explain myself clearly. What I mean is that &foobar
should have been treated as '&foobar' literally (i.e. text), and
&forbat; should be an entityref and &#forbar; as charref.
Currently, sgmllib treated &foobar as entityref and &#foobar as
charref and match it against entityref table and charref table.
Ignores the entity when a match is not found.
My suggested change should fix this problem. Run test.py
(test.py and test.html attached)
>./test.py
Me! Me & You! Copyright@copy;abc Copyright©abc © ©
But we are expecting:
Me&you! Me & You! Copyright@copy;abc Copyright©abc © ©
My suggested change will print the expected output.
# test.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3c.org/TR/html4/strict.dtd">
<html>
<head dir="ltr" lang="en">
<TITLE>Testing Page</TITLE>
<META name="AUTHOR" content="Bernard Yue">
<META name="DESCRIPTION" content="Testing Page">
</head>
<body>
<p>Me&you! Me & You! Copyright@copy;abc
Copyright©abc © ©
</p>
</body>
</html>
# test.py
#!/usr/bin/env python
from htmllib import HTMLParser
from formatter import AbstractFormatter, DumbWriter
def test():
_formatter = AbstractFormatter( DumbWriter())
_parser = HTMLParser( _formatter)
_f = open( './test.html')
_parser.feed( _f.read())
_f.close()
_parser.close()
print ''
if __name__ == '__main__':
test()
----------------------------------------------------------------------
Comment By: Martin v. Löwis (loewis)
Date: 2002-01-08 14:02
Message:
Logged In: YES
user_id=21627
I fail to see the problem as well. Please attach an example
document to this report. Without a detailed analysis of the
problem in question, there is zero chance that any change
like this is accepted.
Here is my analysis from your report: It seems that you
complain that sgmllib, when it sees an ill-formed document,
behaves in a particular way, whereas you expect to behave it
in a different way. Since the document is ill-formed
anyways, any behaviour is as good as any other.
----------------------------------------------------------------------
Comment By: Skip Montanaro (montanaro)
Date: 2002-01-08 13:03
Message:
Logged In: YES
user_id=44345
Bernie,
I see nothing wrong in principal with recognizing
" "
when the user should have typed " ", but I wonder
about
the validity of " ". You mentioned it's still
a charref or
entityref. Is that documented somewhere or
is it simply a practical
approach to a common problem?
Thanks,
Skip
----------------------------------------------------------------------
You can respond by visiting:
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=500073&group_id=5470