Lxml fails to parse httpbin.org example utf-8 page
Hello all, I was doing some tests with lxml and decided to try it out on the test response pages of httpbin.org. Lxml fails to 'correctly' parse the example utf8 example page supplied by httpbin.org. The page can be found here: http://httpbin.org/encoding/utf8. Here is a reproduction of the case: > import requests > r = requests.get("http://httpbin.org/encoding/utf8") > html = r.text > print(html) [...] > from lxml import etree > etree_parser = etree.HTMLParser(encoding='utf-8') > tree = etree.fromstring(html, parser=etree_parser) > new_html = etree.tostring(tree, method='html', encoding='utf-8') > print(new_html) [...] The new_html is truncated after a `<` character in the `pre` tag of the original response. I presume this is because lxml attempts to interpret the `<` character as the start of an html tag. Does lxml have any heuristics for deciding whether to interpret a lone `<` character as a text character as opposed to a html tag initiator? Cheers Austin
participants (3)
-
Austin Platt
-
Holger Joukl
-
Holger Joukl