HTMLParser not parsing whole html file

josh logan dear.jay.logan at gmail.com
Sun Oct 24 22:14:01 EDT 2010


On Oct 24, 4:38 pm, josh logan <dear.jay.lo... at gmail.com> wrote:
> On Oct 24, 4:36 pm, josh logan <dear.jay.lo... at gmail.com> wrote:
>
>
>
>
>
> > Hello,
>
> > I wanted to use python to scrub an html file for score data, but I'm
> > having trouble.
> > I'm using HTMLParser, and the parsing seems to fizzle out around line
> > 192 or so. None of the event functions are being called anymore
> > (handle_starttag, handle_endtag, etc.) and I don't understand why,
> > because it is a html page over 1000 lines.
>
> > Could someone tell me if this is a bug or simply a misunderstanding on
> > how HTMLParser works? I'd really appreciate some help in
> > understanding.
>
> > I am using Python 3.1.2 on Windows 7 (hopefully shouldn't matter).
>
> > I put the HTML file on pastebin, because I couldn't think of anywhere
> > better to put it:http://pastebin.com/wu6Pky2W
>
> > The source code has been pared down to the simplest form to exhibit
> > the problem. It is displayed below, and is also on pastebin for
> > download (http://pastebin.com/HxwRTqrr):
>
> > import sys
> > import re
> > import os.path
> > import itertools as it
> > import urllib.request
> > from html.parser import HTMLParser
> > import operator as op
>
> > base_url = 'http://www.dci.org'
>
> > class TestParser(HTMLParser):
>
> >     def handle_starttag(self, tag, attrs):
> >         print('position {}, staring tag {} with attrs
> > {}'.format(self.getpos(), tag, attrs))
>
> >     def handle_endtag(self, tag):
> >         print('ending tag {}'.format(tag))
>
> > def do_parsing_from_file_stream(fname):
> >     parser = TestParser()
>
> >     with open(fname) as f:
> >         for num, line in enumerate(f, start=1):
> >             # print('Sending line {} through parser'.format(num))
> >             parser.feed(line)
>
> > if __name__ == '__main__':
> >     do_parsing_from_file_stream(sys.argv[1])
>
> Sorry, the group doesn't like how i surrounded the Python code's
> pastebin URL with parentheses:
>
> http://pastebin.com/HxwRTqrr

I found the error. The HTML file I'm parsing has invalid HTML at line
193.
It has something like:

<a href="mystuff "class = "stuff">

Note there is no space between the closing quote for the "href" tag
and the class attribute. I guess I'll go through each file and correct
these issues as I parse them.

Thanks for reading, anyways.



More information about the Python-list mailing list