HTMLParser and Quotes
Richard West
rwest2 at opti.cgi.net
Thu Jan 2 16:06:57 EST 2003
Thank you! Fortunately my app is not high volume. This should do
nicely.
-Richard
On Thu, 02 Jan 2003 13:43:04 -0700, Andrew Dalke
<adalke at mindspring.com> wrote:
>Richard Brodie:
>> HTMLParser is a fairly straightforward parser: it mostly follows the SGML
>> syntax rules. That means that it is of little use for most of the HTML out on
>> the web. Whilst an DWIM parser might be useful, it could get out of hand,
>> and I'm fairly happy that the standard library one stops on the first error.
>> In a few years the XML ones will error anyway.
>
>In the meanwhile, you can use something like HTML Tidy
> http://tidy.sourceforge.net/
>and Marc-André Lemburg Python interface to it, mxTidy
> http://www.lemburg.com/files/python/mxTidy.html
>to clean up input HTML, like this
>
> >>> from mx import Tidy
> >>> from HTMLParser import HTMLParser
> >>> text = """<html>
>... <body>
>... <font face=arial,helvetica>test</font>
>... </body>
>... </html>"""
> >>>
> >>> print Tidy.Tidy.tidy(text)[2]
><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
><html>
><head>
><title></title>
></head>
><body>
><font face="arial,helvetica">test</font>
></body>
></html>
>
> >>>
> >>> x = HTMLParser()
>
> Andrew
> dalke at dalkescientific.com
More information about the Python-list
mailing list