HTMLParser bug ?

Anand Pillai pythonguy at Hotpop.com
Fri May 9 02:19:34 EDT 2003


I do agree with you. But my requirement is a really
robust parser which does not fail even if the html code
contains some invalid HTML. I have seen many pages with
this kind of code (my own homepage for example ;-)). My
program should not fail if it encounters such a page.

Thanks for the suggestion but I think I will modify 
HTMLParser code for my purpose than cleaning html using
another module. THat will take time and will slow down
the sucking.

Anand Pillai
http://members.fortunecity.com/anandpillai


Grzegorz Adam Hankiewicz <gradha at titanium.sabren.com> wrote in message news:<mailman.1052422394.29532.python-list at python.org>...
> On 2003-05-08, Anand B Pillai <abpillai at lycos.com> wrote:
> > I am developing a web spider program in pure python.  I am using
> > the HTMLParser module in the python standard distribution. (The
> > stand-alone HTMLParser, not the htmllib.HTMLParser)
> > 
> > I have found some bugs with this module.  Here is a very simple
> > one.  For the following html data, [...]
> 
> Using w3's validator:
> """
> This page is not Valid HTML 4.01 Transitional!
> 
>    Below are the results of attempting to parse this document with
>    an SGML parser.
> 
>     Line 8, column 18: character "," not allowed in attribute
>     specification list (explain...).
> 
>    <font face="Arial", size=5>Paragraph 1</font>
> """
> > HTMLParser gives the following error.  "malformed start tag, at
> > line 8, column 19" [...]  This rendered many webpages faulty for
> > my spider program.  So I have made the following modification in
> > HTMLParser and it works.
> 
> Of course, but since it seems it is malformed HTML you might as
> well correct the HTML. If you can't, and still must process that
> HTML, please google for mxTidy. It's an html cleanup module you
> can use on the input data before doing your processing. So far,
> HTMLParser has not given me any problems with `tidied' data.




More information about the Python-list mailing list