[Tutor] Exception repeated in a loop
Kent Johnson
kent37 at tds.net
Tue Dec 6 18:00:10 CET 2005
Jan Eden wrote:
>Hi,
>
>I use the following loop to parse some HTML code:
>
>for record in data:
> try:
> parser.feed(record['content'])
> except HTMLParseError, (msg):
> print "!!!Parsing error in", record['page_id'], ": ", msg
>
>Now after HTMLParser encounters a parse error in one record, it repeats to execute the except statement for all following records - why is that?
>
>!!!Parsing error in 8832 : bad end tag: '</em b>', at line 56568, column 1647999
>!!!Parsing error in 8833 : bad end tag: '</em b>', at line 56568, column 1651394
>!!!Parsing error in 8834 : bad end tag: '</em b>', at line 56568, column 1654789
>!!!Parsing error in 8835 : bad end tag: '</em b>', at line 56568, column 1658184
>
The parser processes up to the error. It never recovers from the error.
HTMLParser has an internal buffer and buffer pointer that is never
advanced when an error is detected; each time you call feed() it tries
to parse the remaining data and gets the same error again. Take a look
at HTMLParser.goahead() in Lib/HTMLParser.py if you are interested in
the details.
IIRC HTMLParser is not noted for handling badly formed HTML. Beautiful
Soup, ElementTidy, or HTML Scraper might be a better choice depending on
what you are trying to do.
Kent
More information about the Tutor
mailing list