Re: [lxml-dev] Some HTML target processing issues

Hi, please keep the list involved. Max Ivanov wrote:
Then how could I add tolerance to unknown tag into HTMLParser?
You can't change the parser. It already parses with the "recover" option, so it tries to keep going as long as possible. The problem here is that when you use a target parser, it currently raises an exception at the end if errors occurred during the parsing. It *might* be better to disable that based on the recover option, but I'll have to look into that.
Can you come up with a patch with a couple of simple test cases for src/lxml/tests/test_htmlparser.py that show the three problems you describe? That usually makes them easier (read: faster) to fix. There are some target parser test cases in test_etree.py and test_elementtree.py that you can look at for inspiration.
Thx, I'll try to write tests, but I've never done it before. It looks quite clear, but I've no idea how to run tests itself.
It's pretty easy. Each test has a method in the test case class that will be called by the test runner. Reading a few of the existing test methods should get you going. There is a script "test.py" in the root directory that you can call to run the tests ("make test" does that, for example). It will walk through the directory hierarchy and collect all test classes it finds into a unit test suite (based on the unittest module), and then run them. Try "python test.py -vv" to get some verbose output. Stefan

Here is one test for problem with not calling targets' close() method when XMLSyntaxError is raised during SAX-like parsing even with recover=True. This is addition to test_htmlparser.py: def test_module_target_on_raise_stop(self): class Target(object): def __init__(self, res): self.res = res def start(self, tag, attrib): pass def end(self, tag): pass def close(self): self.res.append(True) result = [] parser = self.etree.HTMLParser(target=Target(result), recover=True) parse = self.etree.parse f = BytesIO(self.broken_html_str) self.assertRaises(self.etree.XMLSyntaxError, parse, f, parser) self.assertEqual(result[-1],True)

Hi, Max Ivanov wrote:
Here is one test for problem with not calling targets' close() method when XMLSyntaxError is raised during SAX-like parsing even with recover=True. This is addition to test_htmlparser.py:
def test_module_target_on_raise_stop(self): class Target(object): def __init__(self, res): self.res = res def start(self, tag, attrib): pass def end(self, tag): pass def close(self): self.res.append(True)
result = [] parser = self.etree.HTMLParser(target=Target(result), recover=True) parse = self.etree.parse f = BytesIO(self.broken_html_str) self.assertRaises(self.etree.XMLSyntaxError, parse, f, parser) self.assertEqual(result[-1],True)
Thanks, could you file a bug report in the launchpad bug tracker so that this doesn't get lost? Stefan
participants (2)
-
Max Ivanov
-
Stefan Behnel