htmlparser handling '<' inside valid tags
data:image/s3,"s3://crabby-images/f35aa/f35aa1745f901422b2bab4ce5a10eca36629cd61" alt=""
Hi all, Thanks for a great library. While using lxml for webscraping I discovered the following behavior: import lxml
results in discarding the text inside the tag 'a': <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" " http://www.w3.org/TR/REC-html40/loose.dtd"> <html><body><a></a></body></html> and using "<back" inside the <a></a> results in the creating of an non-html tag 'back' <html><body><a><back></back></a></body></html> In both cases not what you would want when scraping. Is this a known issue or bug? Is there a way around his? Nico de Groot Python : 2.7.1 lxml.etree : (2, 3, 0, 0) libxml used : (2, 7, 7) libxml compiled : (2, 7, 7) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26)
data:image/s3,"s3://crabby-images/a58ac/a58acd3305be090db6a312c3fb0d9c0f3dfd6745" alt=""
Hi Nico "<html><a><<back</a></html>" is not a valid html, "<html><a><<back</a></html>" is :) The "<back" is treated as a html tag, and it is normal behavior. You might want to read some about ElementSoup parser http://lxml.de/elementsoup.html Regards Piotr 2011/10/27 Nico de Groot <ndegroot0@gmail.com>
data:image/s3,"s3://crabby-images/a58ac/a58acd3305be090db6a312c3fb0d9c0f3dfd6745" alt=""
Hi Nico "<html><a><<back</a></html>" is not a valid html, "<html><a><<back</a></html>" is :) The "<back" is treated as a html tag, and it is normal behavior. You might want to read some about ElementSoup parser http://lxml.de/elementsoup.html Regards Piotr 2011/10/27 Nico de Groot <ndegroot0@gmail.com>
participants (2)
-
Nico de Groot
-
Piotr Owcarz