Turning HTMLParser into an iterator
Stefan Behnel
stefan_ml at behnel.de
Mon Jun 1 02:38:52 EDT 2009
samwyse wrote:
> I'm processing some potentially large datasets stored as HTML. I've
> subclassed HTMLParser so that handle_endtag() accumulates data into a
> list, which I can then fetch when everything's done. I'd prefer,
> however, to have handle_endtag() somehow yield values while the input
> data is still streaming in. I'm sure someone's done something like
> this before, but I can't figure it out. Can anyone help? Thanks.
If you can afford stepping away from HTMLParser, you could give lxml a try.
Its iterparse() function supports HTML parsing.
http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk
Stefan
More information about the Python-list
mailing list