[Python-Dev] Patches: 1 for the price of 10.

Titus Brown titus at caltech.edu
Thu Dec 23 02:34:23 CET 2004


-> Jeremy Hylton wrote:
-> >I got started on these this morning, will likely finish them tomorrow.
-> > It would be perverse to apply your patch last, wouldn't it?
-> 
-> It turns out that Titus' patch might be more involved than he thought
-> it would be.

*shrug* that's life ;).  I stole my patch from the other HTMLParser &
thought that would be sufficient; now I'll have to fix both!

-> In any case, the review itself is a highly appreciated contribution.

It was very educational; just wish I could remember to always submit
context diffs!  <sigh>

The only patch that I think deserves some actual discussion - here or on
c.l.p, not sure which -- is patch 755660, which deals with
HTMLParser.HTMLParser.  The goal of the original submitter was to allow
subclasses of HTMLParser deal with bad HTML in a more robust way;
essentially this comes down to allowing returns from self.error() calls.

I have now come across the same problem in my work with PBP
(pbp.berlios.de): it turns out that many Web pages (such as the
SourceForge mailman admindb page...) contain errors that cause
HTMLParser to raise an exception.  It's simply not possible to reliably
change this behavior within either htmllib.HTMLParser or
HTMLParser.HTMLParser as they're currently written.  This is a big
problem for people basing packages on either HTMLParser class.

An additional problem is that both HTMLParser.HTMLParser and
htmllib.HTMLParser are based on other classes that call self.error(), so 
those base classes would have to altered to fit the new behavior.

What I proposed doing in my comment on patch 755660 was changing
HTMLParser.HTMLParser (and its base class markupbase, too) to call
_fail() when a hard failure was called for, and otherwise to call
error() and proceed parsing on an as-best-possible basis.  This wouldn't
change the *behavior* of the existing code, but would allow for it to be
overridden when necessary.

Right now the error() call is undocumented and so it's probably
ok to change what happens upon return.  As is it can leave the parser
in a borked state upon return, and that's the behavior I propose to
fix.

I'd of course be willing to do the work & submit the more involved
patch.

Your opinions are not only welcome but (as I understand it) necessary ;).

cheers,
--titus


More information about the Python-Dev mailing list