Tidy HTML, was: "<!" in SGMLParser - an error ?
Hernan M. Foffani
hfoffani at yahoo.com
Thu Nov 15 12:20:23 CET 2001
"David Bolen" <db3l at fitlinxx.com> escribió en el mensaje
news:ubsi7zc8x.fsf at ctwd0143.fitlinxx.com...
> David Eppstein <eppstein at ics.uci.edu> writes:
> > Sure. But if you want to parse HTML that you don't control, you are
> > to have to be ready to handle invalid input and do something reasonable
> > with it.
> Yep - although "reasonable" could be declare it invalid depending on
> the problem space :-)
> I do think in this case the right thing is actually happening - the
> document is generating an SGMLParseError due to bad syntax. But true,
> I expect the original poster needs to determine how best to handle the
> problem document since I expect just rejecting it is not desirable.
The fact that with Python is soooo easy to grab and extract data from
remote pages that annoys a lot when such pages aren't valid HTML.
It's unfair to require that htmllib &co parses invalid HTML though.
This problem can be solved with a simple routine that calls tidy
through a pipe before calling the parser.
More information about the Python-list