[Patches] [ python-Patches-755660 ] allow HTMLParser to continue
after a parse error
SourceForge.net
noreply at sourceforge.net
Sun Dec 19 01:45:28 CET 2004
Patches item #755660, was opened at 2003-06-16 19:27
Message generated for change (Comment added) made by titus
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=755660&group_id=5470
Category: Library (Lib)
Group: Python 2.3
Status: Open
Resolution: None
Priority: 5
Submitted By: Steven Rosenthal (smroid)
Assigned to: Nobody/Anonymous (nobody)
Summary: allow HTMLParser to continue after a parse error
Initial Comment:
The HTMLParser.error method raises HTMLParseError,
terminating the parse upon detection of a parse error.
This patch is to allow HTMLParser to continue parsing
if the error() method is overridden to not throw an
exception.
Doc impact is on the error() method. The existing
test_htmlparser.py unit test is unaffected by the patch.
The base file is HTMLParser.py, revision 1.11.2.1
----------------------------------------------------------------------
Comment By: Titus Brown (titus)
Date: 2004-12-18 16:45
Message:
Logged In: YES
user_id=23486
This patch allows developers to override the behavior of HTMLParser
when parsing malformed HTML. Normally HTMLParser calls the function
self.error(), which raises an exception. This patch adds appropriate
return values for situations where self.error has been redefined in
subclasses to *not* raise an exception.
It does not change the default behavior of HTMLParser and so presents
no backwards compatibility issues.
The patch itself consists of an added comment and two added lines of
code that call 'return' with appropriate values after a self.error call.
Nothing wrong with 'em. I can't verify that the "junk characters" error
call will leave the parser in a good state, though, if execution returns
from error().
The library documentation could be updated to reflect the ability to
override
error() behavior; I've written a short patch, available at
http://issola.caltech.edu/~t/transfer/HTMLParser-doc-error.patch
More problems exist with markupbase.py, upon which HTMLParser is
based.
markupbase calls error() as well, and has some stickier situations. See
comments in bug 917188 as well.
Comments in 683938 and 699079 suggest that raising an exception is the
correct response to the parse errors. I recommend application of the
patch anyway, because it (a) doesn't change any behavior by default
and (b) may solve some problems for people.
An alternative would be to distinguish between unrecoverable errors
and recoverable errors by having two different functions, e.g. error()
(for
recoverable errors) and _fail() (for unrecoverable errors). By default
error() would call _fail() and internal code could be changed to call
_fail() where recovery is impossible. This might alter behavior in
situations where subclasses override error() but then again that's not
legitimate to do anyway, at least not at the moment -- error() isn't
in the docs ;).
If nothing done, at least close patch 755660 and bug 736428 with a
comment saying that this behavior will not be addressed ;).
----------------------------------------------------------------------
Comment By: Steven Rosenthal (smroid)
Date: 2003-06-17 20:13
Message:
Logged In: YES
user_id=159908
this fixes bug #736428 (submitted by me earlier)
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=755660&group_id=5470
More information about the Patches
mailing list