[Python-bugs-list] [ python-Bugs-500073 ] HTMLParser fail to handle '&foobar'

Tue, 08 Jan 2002 16:43:34 -0800

Bugs item #500073, was opened at 2002-01-06 00:06
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=500073&group_id=5470

Category: Extension Modules
Group: Python 2.1.1
Status: Open
Resolution: None
Priority: 5
Submitted By: Bernard YUE (berniey)
Assigned to: Skip Montanaro (montanaro)
Summary: HTMLParser fail to handle '&foobar'

Initial Comment:
HTMLParser did not distingish between &foobar; and 
&foobar.  The later is still considered as a 
charref/entityref.  Below is my posposed fix:

File:  sgmllib.py

# SGMLParser.goahead()
# line 162-176
# from
            elif rawdata[i] == '&':
                match = charref.match(rawdata, i)
                if match:
                    name = match.group(1)
                    self.handle_charref(name)
                    i = match.end(0)
                    if rawdata[i-1] != ';': i = i-1
                    continue
                match = entityref.match(rawdata, i)
                if match:
                    name = match.group(1)
                    self.handle_entityref(name)
                    i = match.end(0)
                    if rawdata[i-1] != ';': i = i-1
                    continue

# to
            elif rawdata[i] == '&'
                match = charref.match(rawdata, i)
                if match:
                    if rawdata[match.end(0)-1] != ';':
                        # not really an charref
                        self.handle_data(rawdata[i])
                        i = i+1
                    else:
                        name = match.group(1)
                        self.handle_charref(name)
                        i = match.end(0)
                    continue
                match = entityref.match(rawdata, i)
                if match:
                    if rawdata[match.end(0)-1] != ';':
                        # not really an entitiyref
                        self.handle_data(rawdata[i])
                        i = i+1
                    else: 
                        name = match.group(1)
                        self.handle_entityref(name)
                        i = match.end(0)
                    continue

----------------------------------------------------------------------

>Comment By: Bernard YUE (berniey)
Date: 2002-01-08 16:43

Message:
Logged In: YES 
user_id=419276

Hi Martin and Skip,

Sorry for not explain myself clearly.  What I mean is that &foobar 
should have been treated as '&foobar' literally (i.e. text), and 
&forbat; should be an entityref and &#forbar; as charref.

Currently, sgmllib treated &foobar as entityref and &#foobar as 
charref and match it against entityref table and charref table.  
Ignores the entity when a match is not found.

My suggested change should fix this problem.  Run test.py 
(test.py and test.html attached)

>./test.py

Me! Me & You! Copyright@copy;abc CopyrightŠabc Š Š

But we are expecting:
Me&you! Me & You! Copyright@copy;abc CopyrightŠabc Š Š

My suggested change will print the expected output.

# test.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
        "http://www.w3c.org/TR/html4/strict.dtd">

<html>
<head dir="ltr" lang="en">
  <TITLE>Testing Page</TITLE>
  <META name="AUTHOR" content="Bernard Yue">
  <META name="DESCRIPTION" content="Testing Page">
</head>
<body>
  <p>Me&you!  Me & You! Copyright@copy;abc 
Copyright&#169;abc &copy; &#169;
  </p>
</body>
</html>

# test.py
#!/usr/bin/env python

from htmllib import HTMLParser
from formatter import AbstractFormatter, DumbWriter

def test():
    _formatter = AbstractFormatter( DumbWriter())
    _parser = HTMLParser( _formatter)
    _f = open( './test.html')

    _parser.feed( _f.read())
    _f.close()
    _parser.close()
    print ''

if __name__ == '__main__':
    test()

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-01-08 14:02

Message:
Logged In: YES 
user_id=21627

I fail to see the problem as well. Please attach an example
document to this report. Without a detailed analysis of the
problem in question, there is zero chance that any change
like this is accepted.

Here is my analysis from your report: It seems that you
complain that sgmllib, when it sees an ill-formed document,
behaves in a particular way, whereas you expect to behave it
in a different way. Since the document is ill-formed
anyways, any behaviour is as good as any other.

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2002-01-08 13:03

Message:
Logged In: YES 
user_id=44345

Bernie,

I see nothing wrong in principal with recognizing 
"&nbsp"
when the user should have typed "&nbsp;", but I wonder
about 
the validity of "&nbsp".  You mentioned it's still
a charref or 
entityref.  Is that documented somewhere or
is it simply a practical 
approach to a common problem?

Thanks,

Skip

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=500073&group_id=5470