Handling bad tags with SGMLParser

Fri Mar 8 11:48:41 EST 2002

On Thu, 2002-03-07 at 11:12, Sean 'Shaleh' Perry wrote:
> > 
> > The user of SGMLParser needs to be able to handle invalid tags.  This
> > handling may be complex or as simple as just ignoring it and asking
> > SGMLParser to skip this tag and move along.  As far as I can tell this
> > is not an option.
> > 
> > As a side note, the text error message thrown is particularly
> > uninformation as it simply includes the first letter of the tag, in
> > other words always '<'.
> > 
> 
> match = special.match(rawdata, i)
> if match:
>     if self.literal:
>         self.handle_data(rawdata[i])
>         i = i+1
>         continue
>     # This is some sort of declaration; in "HTML as
>     # deployed," this should only be the document type
>     # declaration ("<!DOCTYPE html...>").
>     k = self.parse_declaration(i)
>     if k < 0: break
>     i = k
>     continue
> 
> is the offending code.  'special' is defined as re.compile(r'<![^<>]*>').
> 
> I see two options:
> 
> 1) change the definition of special to a noop match.  Something that is
> relatively cheap but can never match.
> 
> 2) write your own parse_declaration() method.
> 

Well, I upgraded to Python2.2, which didn't exactly fix the problem, but
changed the complexion somewhat.  For now I decided to hack up sgmllib
so that any SGMLParseError's on tags starting with <! simple result in
the tag being skipped.  It's not pretty, but it gets me past this
problem.  SGMLParse really needs to supply some mechanism for handling
the case of a bad tag.

Ken