Tidy HTML, was: "<!" in SGMLParser - an error ?

Walter Dörwald walter at livinglogic.de
Thu Nov 15 08:48:24 EST 2001


Hernan M. Foffani wrote:


> The fact that with Python is soooo easy to grab and extract data from
> remote pages that annoys a lot when such pages aren't valid HTML.
> 
> It's unfair to require that htmllib &co parses invalid HTML though.
> This problem can be solved with a simple routine that calls tidy
> through a pipe before calling the parser.


Better yet, use Marc-André Lemburgs mxTidy
(http://www.lemburg.com/files/python/mxTidy.html)


Bye,

    Walter Dörwald





More information about the Python-list mailing list