[Python-bugs-list] [ python-Bugs-500073 ] HTMLParser fail to handle '&foobar'

noreply@sourceforge.net noreply@sourceforge.net
Tue, 08 Jan 2002 13:03:12 -0800


Bugs item #500073, was opened at 2002-01-06 00:06
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=500073&group_id=5470

Category: Extension Modules
Group: Python 2.1.1
Status: Open
Resolution: None
Priority: 5
Submitted By: Bernard YUE (berniey)
Assigned to: Skip Montanaro (montanaro)
Summary: HTMLParser fail to handle '&foobar'

Initial Comment:
HTMLParser did not distingish between &foobar; and 
&foobar.  The later is still considered as a 
charref/entityref.  Below is my posposed fix:

File:  sgmllib.py

# SGMLParser.goahead()
# line 162-176
# from
            elif rawdata[i] == '&':
                match = charref.match(rawdata, i)
                if match:
                    name = match.group(1)
                    self.handle_charref(name)
                    i = match.end(0)
                    if rawdata[i-1] != ';': i = i-1
                    continue
                match = entityref.match(rawdata, i)
                if match:
                    name = match.group(1)
                    self.handle_entityref(name)
                    i = match.end(0)
                    if rawdata[i-1] != ';': i = i-1
                    continue

# to
            elif rawdata[i] == '&'
                match = charref.match(rawdata, i)
                if match:
                    if rawdata[match.end(0)-1] != ';':
                        # not really an charref
                        self.handle_data(rawdata[i])
                        i = i+1
                    else:
                        name = match.group(1)
                        self.handle_charref(name)
                        i = match.end(0)
                    continue
                match = entityref.match(rawdata, i)
                if match:
                    if rawdata[match.end(0)-1] != ';':
                        # not really an entitiyref
                        self.handle_data(rawdata[i])
                        i = i+1
                    else: 
                        name = match.group(1)
                        self.handle_entityref(name)
                        i = match.end(0)
                    continue



----------------------------------------------------------------------

>Comment By: Skip Montanaro (montanaro)
Date: 2002-01-08 13:03

Message:
Logged In: YES 
user_id=44345

Bernie,

I see nothing wrong in principal with recognizing 
"&nbsp"
when the user should have typed " ", but I wonder
about 
the validity of "&nbsp".  You mentioned it's still
a charref or 
entityref.  Is that documented somewhere or
is it simply a practical 
approach to a common problem?

Thanks,

Skip


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=500073&group_id=5470