Handling bad tags with SGMLParser

Ken Causey ken at ineffable.com
Thu Mar 7 12:30:26 EST 2002


On Thu, 2002-03-07 at 11:12, Sean 'Shaleh' Perry wrote:
> > 
> > The user of SGMLParser needs to be able to handle invalid tags.  This
> > handling may be complex or as simple as just ignoring it and asking
> > SGMLParser to skip this tag and move along.  As far as I can tell this
> > is not an option.
> > 
> > As a side note, the text error message thrown is particularly
> > uninformation as it simply includes the first letter of the tag, in
> > other words always '<'.
> > 
> 
> match = special.match(rawdata, i)
> if match:
>     if self.literal:
>         self.handle_data(rawdata[i])
>         i = i+1
>         continue
>     # This is some sort of declaration; in "HTML as
>     # deployed," this should only be the document type
>     # declaration ("<!DOCTYPE html...>").
>     k = self.parse_declaration(i)
>     if k < 0: break
>     i = k
>     continue
> 
> is the offending code.  'special' is defined as re.compile(r'<![^<>]*>').
> 
> I see two options:
> 
> 1) change the definition of special to a noop match.  Something that is
> relatively cheap but can never match.
> 
> 2) write your own parse_declaration() method.
> 

Yes, but both of these changes would violate the object model of
SGMLParser, at least as evidenced by the module documentation.  It would
also, I suspect be non-portable among Python versions.  For example the
behaviour of 1.5.2 is quite different here.

I'm inclined at the moment to report this situation as a bug.  But I
wanted to get some opinions as I'm neither an SGML expert nor fully
confident of my understanding of the SGMLParser class.

The easiest fix I can come up with at the moment is to modify the
SGMLParseError throw so that the position in the rawdata of the error is
returned with the exception so that the data can be resubmitted skipping
the offending tag (reporting the endpoint of the bad tag would be even
handier).  Of course this also involves modifying private implementation
details.

By the way, I'm nikos on #python if anyone cares to discuss this in a
more "live" context.

Ken





More information about the Python-list mailing list