HTMLParser handler_starttag misses lots of tags!

Robert Brewer fumanchu at amor.org
Fri Nov 21 22:53:44 EST 2003


When I try it like this:

import HTMLParser

class HP(HTMLParser.HTMLParser):

    def handle_starttag(self, tag, data):
        print "tag is %s." % (tag)

    def handle_comment(self, data):
        print "caught a comment: %s." % (data)

    def handle_data(self, data):
        if "IP" in data:
            print "Caught %s." % data

toParse = """
<html>

<head>
	<meta http-equiv="content-type"
content="text/html;charset=ISO-8859-1">
	<meta name="generator" content="Adobe GoLive 5">
------- 8< Much html snipped here for this email ---------
</TABLE>
</form>
</body>

</html>"""

for line in toParse.split(u'\n'):
    HP().feed(line)



I get:

tag is html.
tag is head.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is meta.
tag is title.
tag is link.
tag is script.
tag is body.
tag is form.
tag is table.
tag is tr.
tag is td.
tag is h1.
caught a comment:  RULE //.
tag is tr.
tag is td.
tag is img.
caught a comment:  END RULE //.
tag is tr.
tag is td.
tag is b.
tag is td.
tag is tr.
tag is td.
tag is b.
tag is td.
caught a comment:  RULE //.
tag is tr.
tag is td.
tag is img.
caught a comment:  END RULE //.
tag is tr.
tag is td.
tag is span.
tag is tr.
tag is td.
tag is b.
tag is td.
tag is tr.
tag is td.
tag is b.
Caught IP Address .
tag is td.
tag is tr.
tag is td.
tag is b.
tag is td.
tag is tr.
tag is td.
tag is b.
Caught IP Subnet Mask .
tag is td.
tag is tr.
tag is td.
tag is b.
tag is td.
tag is tr.
tag is td.
tag is b.
tag is td.
caught a comment:  RULE //.
tag is tr.
tag is td.
tag is img.
caught a comment:  END RULE //.
tag is tr.
tag is td.
tag is span.
tag is tr.
tag is td.
tag is b.
tag is td.
tag is tr.
tag is td.
tag is b.
Caught IP Address .
tag is td.
tag is tr.
tag is td.
tag is b.
tag is td.
tag is tr.
tag is td.
tag is b.
Caught IP Subnet Mask .
tag is td.
tag is table.
tag is tr.
tag is td.
tag is img.
tag is tr.
tag is td.
tag is span.
tag is table.
tag is tr.
tag is td.
tag is b.
tag is td.
tag is table.
tag is td.
tag is b.
tag is td.
tag is td.
tag is b.
tag is td.
tag is td.
tag is b.
tag is td.
tag is table.
tag is tr.
tag is td.
tag is img.
tag is tr.
tag is td.
tag is input.
tag is input.

My guess is the problem lies in your line-separation logic, not
HTMLParser. IIRC, open() doesn't split by line automatically. Note that
this doesn't answer the entire question. My guess is that HTMLParser,
once it encounters the form tag, treats everything inside that form tag
(even other tags) as data to be consumed by handle_data(). Once it
encounters the closing form tag, it might stop. Either re-feed() that or
get the line-splitting right.

Just some out-loud thoughts.


Robert Brewer
MIS
Amor Ministries
fumanchu at amor.org

> -----Original Message-----
> From: Matthew Wilson [mailto:mwilson at sarcastic-horse.com] 
> Sent: Friday, November 21, 2003 6:44 PM
> To: python-list at python.org
> Subject: HTMLParser handler_starttag misses lots of tags!
> 
> 
> I want to parse an html file and extract my router's IP address.  I
> wrote this code and I have python 2.3 installed:
> 
> #! /usr/bin/env python
> 
> import HTMLParser
> 
> class HP(HTMLParser.HTMLParser):
> 
>     def handle_starttag(self, tag, data):
>         print "tag is %s." % (tag)
> 
>     def handle_comment(self, data):
>         print "caught a comment: %s." % (data)
> 
>     def handle_data(self, data):
>         if "IP" in data:
>             print "Caught %s." % data
> 
> hp = HP()
> out = open('routerstatus.html')
> for line in out:
>     hp.feed(line)
> 
> 





More information about the Python-list mailing list