Handling bad tags with SGMLParser
Ken Causey
ken at ineffable.com
Fri Mar 8 11:48:41 EST 2002
On Thu, 2002-03-07 at 11:12, Sean 'Shaleh' Perry wrote:
> >
> > The user of SGMLParser needs to be able to handle invalid tags. This
> > handling may be complex or as simple as just ignoring it and asking
> > SGMLParser to skip this tag and move along. As far as I can tell this
> > is not an option.
> >
> > As a side note, the text error message thrown is particularly
> > uninformation as it simply includes the first letter of the tag, in
> > other words always '<'.
> >
>
> match = special.match(rawdata, i)
> if match:
> if self.literal:
> self.handle_data(rawdata[i])
> i = i+1
> continue
> # This is some sort of declaration; in "HTML as
> # deployed," this should only be the document type
> # declaration ("<!DOCTYPE html...>").
> k = self.parse_declaration(i)
> if k < 0: break
> i = k
> continue
>
> is the offending code. 'special' is defined as re.compile(r'<![^<>]*>').
>
> I see two options:
>
> 1) change the definition of special to a noop match. Something that is
> relatively cheap but can never match.
>
> 2) write your own parse_declaration() method.
>
Well, I upgraded to Python2.2, which didn't exactly fix the problem, but
changed the complexion somewhat. For now I decided to hack up sgmllib
so that any SGMLParseError's on tags starting with <! simple result in
the tag being skipped. It's not pretty, but it gets me past this
problem. SGMLParse really needs to supply some mechanism for handling
the case of a bad tag.
Ken
More information about the Python-list
mailing list