Mailman 3 [lxml-dev] Bug? lxml.html produces bogus results if HTML contains control characters - lxml - The Python XML Toolkit

Nov. 9, 2009

      Hi,

I'm a big fan of lxml.html, but I think I've just found a bug in it:

Here's a test case.

html = """
<html><body>

<table>
<tr>
<td>one</td>
<td>\x05two</td>
<td>three</td>
</tr>
</table>
</body></html>
"""

import lxml.html

tree = lxml.html.fromstring(html)
xpath = "/descendant::table"
cells = tree.xpath(xpath)[0].getchildren()[0].getchildren()
print [cell.text_content() for cell in cells]
# prints ['one', '']
tree = lxml.html.fromstring(html.replace("\x05",""))
cells = tree.xpath(xpath)[0].getchildren()[0].getchildren()
print [cell.text_content() for cell in cells]
# prints ['one', 'two', 'three']

The apparent bug is that lxml.html fails to parse the above HTML
properly when it contains a control character (\x05 a.k.a ^E).
Obviously, well-formed HTML will not contain such characters, but
real-world HTML often does. The library does not report an error, but
simply truncates the row at the control character, which made this
behavior tricky for me track down.

I'm running the following versions:

lxml.etree:        (2, 2, 2, 0)
libxml used:       (2, 7, 6)
libxml compiled:   (2, 7, 5)
libxslt used:      (1, 1, 24)
libxslt compiled:  (1, 1, 26)

I must admit that I do not know if this is a bug in lxml or in libxml,
but I am certainly willing to help investigate.

Best,

Joe

[lxml-dev] Bug? lxml.html produces bogus results if HTML contains control characters

Joseph Barillari

tags

participants (1)