Mailman 3 htmlparser handling '<' inside valid tags - lxml - The Python XML Toolkit

Oct. 27, 2011

      Hi all,

Thanks for a great library.
While using lxml for webscraping I discovered the following behavior:

        import lxml
...
from StringIO import StringIO
        parser = lxml.html.HTMLParser()
        to_parse="<html><a><<back</a></html>"
        tree = lxml.html.parse(StringIO(to_parse),parser)
        print lxml.html.tostring(tree,pretty_print=False,method="html")
results in discarding the text inside the tag 'a':

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "
http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><a></a></body></html>

and using "<back" inside the <a></a> results in the creating of an non-html
tag 'back'

<html><body><a><back></back></a></body></html>

In both cases not what you would want when scraping. Is this a known issue
or bug?  Is there a way around his?

Nico de Groot

Python              : 2.7.1
lxml.etree          : (2, 3, 0, 0)
libxml used         : (2, 7, 7)
libxml compiled     : (2, 7, 7)
libxslt used        : (1, 1, 26)
libxslt compiled    : (1, 1, 26)

htmlparser handling '<' inside valid tags

Nico de Groot

Piotr Owcarz

Piotr Owcarz

tags

participants (2)