On Monday, January 25, 2016 18:11:12 Austin Platt wrote:
Hello,
Thanks for your response!
Unfortunately the problem still remains with your modified code. I also tried passing the un-decoded bytes of the requests response (r.content) and the problem persisted.
My suspicion is that it's to do with how libxml tries to parse the non-entitied `<` character as the start of a tag. Any other ideas?
Sorry, I pretty much misread your mail - you don't run into any parser exception but the parsed and re-serialized content isn't what you expect. I agree that the "non-entitied" '<' character seems to be the problem (which probably means the source document is actually broken HTML). Looks like you could still make it work with the help of BeautifulSoup:
import requests from lxml import etree import lxml.html.soupparser resp = requests.get("http://httpbin.org/encoding/utf8") root = lxml.html.soupparser.fromstring(resp.text, features='html.parser') print etree.tostring(root, encoding='utf-8')
From glancing at it this looks pretty much like the original apart from some HTML-sanitizing, namely using character entities and proper (root) elements (I haven't properly compared characters).
Parsing will probably be way slower than through libxml2's HTML parser, though. Holger