
-> Jeremy Hylton wrote: -> >I got started on these this morning, will likely finish them tomorrow. -> > It would be perverse to apply your patch last, wouldn't it? -> -> It turns out that Titus' patch might be more involved than he thought -> it would be.
*shrug* that's life ;). I stole my patch from the other HTMLParser & thought that would be sufficient; now I'll have to fix both!
-> In any case, the review itself is a highly appreciated contribution.
It was very educational; just wish I could remember to always submit context diffs! <sigh>
The only patch that I think deserves some actual discussion - here or on c.l.p, not sure which -- is patch 755660, which deals with HTMLParser.HTMLParser. The goal of the original submitter was to allow subclasses of HTMLParser deal with bad HTML in a more robust way; essentially this comes down to allowing returns from self.error() calls.
I have now come across the same problem in my work with PBP (pbp.berlios.de): it turns out that many Web pages (such as the SourceForge mailman admindb page...) contain errors that cause HTMLParser to raise an exception. It's simply not possible to reliably change this behavior within either htmllib.HTMLParser or HTMLParser.HTMLParser as they're currently written. This is a big problem for people basing packages on either HTMLParser class.
An additional problem is that both HTMLParser.HTMLParser and htmllib.HTMLParser are based on other classes that call self.error(), so those base classes would have to altered to fit the new behavior.
What I proposed doing in my comment on patch 755660 was changing HTMLParser.HTMLParser (and its base class markupbase, too) to call _fail() when a hard failure was called for, and otherwise to call error() and proceed parsing on an as-best-possible basis. This wouldn't change the *behavior* of the existing code, but would allow for it to be overridden when necessary.
Right now the error() call is undocumented and so it's probably ok to change what happens upon return. As is it can leave the parser in a borked state upon return, and that's the behavior I propose to fix.
I'd of course be willing to do the work & submit the more involved patch.
Your opinions are not only welcome but (as I understand it) necessary ;).
cheers, --titus