[Python-Dev] HTMLParser patches

john paulson munch@acm.org
Mon, 27 Jan 2003 14:19:24 -0800


I've submitted two patches for HTMLParser.py and
test_htmlparser.py.  They were to fix two problems
lexing some html pages I found in the wild.

1. Allow "," in attributes
    A page had the attribute "color=rgb(1,2,3)",
    and the parser choked on the ",".  Added the
    "," to the list of allowed characters.

2. More robust <SCRIPT> processing.
    The eBay homepage has unprotected javascript
    including the line 'vb += "</SCR"+"IPT>".  The
    parser choked on that line.  I modified the
    source to accept a more robust regex for script
    and style endtags.  A side-effect of this is that
    any "<!--" .. "-->" within a script/style will
    be parsed as a comment.  If that behavior is
    incorrect, the regex can be modified.