Tidy HTML, was: "<!" in SGMLParser - an error ?

Thu Nov 15 06:20:23 EST 2001

"David Bolen" <db3l at fitlinxx.com> escribió en el mensaje
news:ubsi7zc8x.fsf at ctwd0143.fitlinxx.com...
> David Eppstein <eppstein at ics.uci.edu> writes:
>
> > Sure.  But if you want to parse HTML that you don't control, you are
going
> > to have to be ready to handle invalid input and do something reasonable
> > with it.
>
> Yep - although  "reasonable" could be declare it  invalid depending on
> the problem space :-)
>
> I do think in this case the right thing is actually happening - the
> document is generating an SGMLParseError due to bad syntax.  But true,
> I expect the original poster needs to determine how best to handle the
> problem document since I expect just rejecting it is not desirable.

The fact that with Python is soooo easy to grab and extract data from
remote pages that annoys a lot when such pages aren't valid HTML.

It's unfair to require that htmllib &co parses invalid HTML though.
This problem can be solved with a simple routine that calls tidy
through a pipe before calling the parser.

Regards,
-Hernán