[lxml-dev] Bug? lxml.html produces bogus results if HTML contains control characters
![](https://secure.gravatar.com/avatar/0b49bf15dd1a4177b430ad2f8073d3c6.jpg?s=120&d=mm&r=g)
Hi, I'm a big fan of lxml.html, but I think I've just found a bug in it: Here's a test case. html = """ <html><body> <table> <tr> <td>one</td> <td>\x05two</td> <td>three</td> </tr> </table> </body></html> """ import lxml.html tree = lxml.html.fromstring(html) xpath = "/descendant::table" cells = tree.xpath(xpath)[0].getchildren()[0].getchildren() print [cell.text_content() for cell in cells] # prints ['one', ''] tree = lxml.html.fromstring(html.replace("\x05","")) cells = tree.xpath(xpath)[0].getchildren()[0].getchildren() print [cell.text_content() for cell in cells] # prints ['one', 'two', 'three'] The apparent bug is that lxml.html fails to parse the above HTML properly when it contains a control character (\x05 a.k.a ^E). Obviously, well-formed HTML will not contain such characters, but real-world HTML often does. The library does not report an error, but simply truncates the row at the control character, which made this behavior tricky for me track down. I'm running the following versions: lxml.etree: (2, 2, 2, 0) libxml used: (2, 7, 6) libxml compiled: (2, 7, 5) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 26) I must admit that I do not know if this is a bug in lxml or in libxml, but I am certainly willing to help investigate. Best, Joe
participants (1)
-
Joseph Barillari