Help with regular expressions
John J. Lee
jjl at pobox.com
Wed Aug 27 09:35:38 EDT 2003
dmbkiwi <dmbkiwi at yahoo.com> writes:
> On Tue, 26 Aug 2003 08:47:33 +0000, Sybren Stuvel wrote:
[...]
> > You seem to expect old HTML. Why not use XHTML only ('tidy' can
> > convert between them) and use a regular XML parser? Much, much, much
> > easier! And you won't have to be afraid of messing up your regular
> > expressions ;-)
> >
> > Sybren
>
> XML would be nice, but unfortunately I have no choice as to the markup
> language used by the site. It's a website on the world wide web, not a
> site overwhich I have any control. My regular expressions are at the
> mercy of the developers of that site.
You misunderstand. HTMLTidy (or its descendant, tidylib) reads ugly,
non-conformant HTML and spits out clean, conformant XHTML (or HTML).
uTidylib is a ctypes wrapper of tidylib.
import tidy
from cStringIO import StringIO
tidydoc = tidy.parseString(html)
s = StringIO()
tidydoc.write(s)
tidied_html = s.getvalue()
mxTidy is a wrapper of a shared-library-ized HTMLTidy.
from mx.Tidy import tidy
tidied_html = tidy(html)[2]
John
More information about the Python-list
mailing list