htmlparser handling '<' inside valid tags
Hi all, Thanks for a great library. While using lxml for webscraping I discovered the following behavior: import lxml
from StringIO import StringIO parser = lxml.html.HTMLParser() to_parse="<html><a><<back</a></html>" tree = lxml.html.parse(StringIO(to_parse),parser) print lxml.html.tostring(tree,pretty_print=False,method="html")
results in discarding the text inside the tag 'a': <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" " http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><a></a></body></html> and using "<back" inside the <a></a> results in the creating of an non-html tag 'back' <html><body><a><back></back></a></body></html> In both cases not what you would want when scraping. Is this a known issue or bug? Is there a way around his? Nico de Groot Python : 2.7.1 lxml.etree : (2, 3, 0, 0) libxml used : (2, 7, 7) libxml compiled : (2, 7, 7) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26)
Hi Nico "<html><a><<back</a></html>" is not a valid html, "<html><a><<back</a></html>" is :) The "<back" is treated as a html tag, and it is normal behavior. You might want to read some about ElementSoup parser http://lxml.de/elementsoup.html Regards Piotr 2011/10/27 Nico de Groot <ndegroot0@gmail.com>
Hi all,
Thanks for a great library. While using lxml for webscraping I discovered the following behavior:
import lxml
from StringIO import StringIO parser = lxml.html.HTMLParser() to_parse="<html><a><<back</a></html>" tree = lxml.html.parse(StringIO(to_parse),parser) print lxml.html.tostring(tree,pretty_print=False,method="html")
results in discarding the text inside the tag 'a':
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" " http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><a></a></body></html>
and using "<back" inside the <a></a> results in the creating of an non-html tag 'back'
<html><body><a><back></back></a></body></html>
In both cases not what you would want when scraping. Is this a known issue or bug? Is there a way around his?
Nico de Groot
Python : 2.7.1 lxml.etree : (2, 3, 0, 0) libxml used : (2, 7, 7) libxml compiled : (2, 7, 7) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26)
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
participants (2)
-
Nico de Groot -
Piotr Owcarz