[Tutor] Exception repeated in a loop
kent37 at tds.net
Tue Dec 6 18:00:10 CET 2005
Jan Eden wrote:
>I use the following loop to parse some HTML code:
>for record in data:
> except HTMLParseError, (msg):
> print "!!!Parsing error in", record['page_id'], ": ", msg
>Now after HTMLParser encounters a parse error in one record, it repeats to execute the except statement for all following records - why is that?
>!!!Parsing error in 8832 : bad end tag: '</em b>', at line 56568, column 1647999
>!!!Parsing error in 8833 : bad end tag: '</em b>', at line 56568, column 1651394
>!!!Parsing error in 8834 : bad end tag: '</em b>', at line 56568, column 1654789
>!!!Parsing error in 8835 : bad end tag: '</em b>', at line 56568, column 1658184
The parser processes up to the error. It never recovers from the error.
HTMLParser has an internal buffer and buffer pointer that is never
advanced when an error is detected; each time you call feed() it tries
to parse the remaining data and gets the same error again. Take a look
at HTMLParser.goahead() in Lib/HTMLParser.py if you are interested in
IIRC HTMLParser is not noted for handling badly formed HTML. Beautiful
Soup, ElementTidy, or HTML Scraper might be a better choice depending on
what you are trying to do.
More information about the Tutor