Help with regular expressions

John J. Lee jjl at pobox.com
Wed Aug 27 09:35:38 EDT 2003


dmbkiwi <dmbkiwi at yahoo.com> writes:
> On Tue, 26 Aug 2003 08:47:33 +0000, Sybren Stuvel wrote:
[...]
> > You seem to expect old HTML. Why not use XHTML only ('tidy' can
> > convert between them) and use a regular XML parser? Much, much, much
> > easier! And you won't have to be afraid of messing up your regular
> > expressions ;-)
> > 
> > Sybren
> 
> XML would be nice, but unfortunately I have no choice as to the markup
> language used by the site.  It's a website on the world wide web, not a
> site overwhich I have any control.  My regular expressions are at the
> mercy of the developers of that site.

You misunderstand.  HTMLTidy (or its descendant, tidylib) reads ugly,
non-conformant HTML and spits out clean, conformant XHTML (or HTML).

uTidylib is a ctypes wrapper of tidylib.

 import tidy
 from cStringIO import StringIO
 tidydoc = tidy.parseString(html)
 s = StringIO()
 tidydoc.write(s)
 tidied_html = s.getvalue()


mxTidy is a wrapper of a shared-library-ized HTMLTidy.

 from mx.Tidy import tidy
 tidied_html = tidy(html)[2]


John




More information about the Python-list mailing list