"<!" in SGMLParser - an error ?

David Bolen db3l at fitlinxx.com
Tue Nov 13 02:32:46 CET 2001

David Eppstein <eppstein at ics.uci.edu> writes:

> Sure.  But if you want to parse HTML that you don't control, you are going 
> to have to be ready to handle invalid input and do something reasonable 
> with it.

Yep - although  "reasonable" could be declare it  invalid depending on
the problem space :-)

I do think in this case the right thing is actually happening - the
document is generating an SGMLParseError due to bad syntax.  But true,
I expect the original poster needs to determine how best to handle the
problem document since I expect just rejecting it is not desirable.

The code that is parsing it will have to decide what to do (e.g.,
"guess" at what the document author meant to write since it wasn't
valid HTML), but I don't think that's a problem with the parser module
- it just places a bigger burden on the application using the module
if it wants to support problem documents.

Personally, I think this is precisely the sort of subtle issues that
arise when browsers try to be "nice" and accept invalid documents, and
then web sites "work best with IE 5.x or greater" or whatever.  It's
not just that you use non-standard features, but you end up with
undocumented heuristics in a limited number of browsers, which in turn
permit the writing of invalid documents, which then hurts folks trying
to write other parsers and applications to deal with those documents.
You end up not only having to properly parse SGML, but guess at what
exceptions to allow to work like the common browsers, since that's
what most people will have tested against (in lieu of validating their
HTML against the DTD or something).

But it's not HTML that leads to this state, but trends where the
standards are not obeyed.

-- David
 \               David Bolen            \   E-mail: db3l at fitlinxx.com  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \

More information about the Python-list mailing list