How do I get to *all* of the groups of an re search?

Daniel Yoo dyoo at hkn.eecs.berkeley.edu
Fri Jan 10 16:55:54 EST 2003


Kyler Laird <Kyler at news.lairds.org> wrote:

:>    Get an HTML parser--then be ready to
:>    tweak it to accept all the junk that roams
:>    around in the wild.

: Exactly.  I think I've thrown up my hands most times I've
: attempted to use an HTML parser.  I considered it for this
: task but after thinking about it for awhile I decided that an
: RE would be far more elegant.

Hi Kyle,

I know this isn't quite addressing your question, but have you seen
HTML-Tidy?

    http://www.w3.org/People/Raggett/tidy/

This utility can enforce a kind of structure to even weird HTML, so
that you can more easily use 'sgmllib' or an HTML parser on tidy'ed
HTML.  Of course, it's not perfect, but it does work admirably well.

There is a Python interface to HTML-Tidy by the author of the
mxTextTools:

    http://www.lemburg.com/files/python/mxTidy.html


Good luck to you!




More information about the Python-list mailing list