Lxml fails to parse httpbin.org example utf-8 page
Hello all, I was doing some tests with lxml and decided to try it out on the test response pages of httpbin.org. Lxml fails to 'correctly' parse the example utf8 example page supplied by httpbin.org. The page can be found here: http://httpbin.org/encoding/utf8. Here is a reproduction of the case: > import requests > r = requests.get("http://httpbin.org/encoding/utf8") > html = r.text > print(html) [...] > from lxml import etree > etree_parser = etree.HTMLParser(encoding='utf-8') > tree = etree.fromstring(html, parser=etree_parser) > new_html = etree.tostring(tree, method='html', encoding='utf-8') > print(new_html) [...] The new_html is truncated after a `<` character in the `pre` tag of the original response. I presume this is because lxml attempts to interpret the `<` character as the start of an html tag. Does lxml have any heuristics for deciding whether to interpret a lone `<` character as a text character as opposed to a html tag initiator? Cheers Austin
Hi,
I was doing some tests with lxml and decided to try it out on the test response pages of httpbin.org.
Lxml fails to 'correctly' parse the example utf8 example page supplied by httpbin.org. The page can be found here: http://httpbin.org/encoding/utf8.
Here is a reproduction of the case:
> import requests > r = requests.get("http://httpbin.org/encoding/utf8") > html = r.text > print(html) [...]
> from lxml import etree > etree_parser = etree.HTMLParser(encoding='utf-8') > tree = etree.fromstring(html, parser=etree_parser) > new_html = etree.tostring(tree, method='html', encoding='utf-8') > print(new_html) [...]
The new_html is truncated after a `<` character in the `pre` tag of the original response. I presume this is because lxml attempts to interpret the `<` character as the start of an html tag.
Does lxml have any heuristics for deciding whether to interpret a lone `<` character as a text character as opposed to a html tag initiator?
from lxml import etree html_parser = etree.HTMLParser() tree = etree.parse("http://httpbin.org/encoding/utf8",
I suspect you may suffer from requests decoding the html resource text to unicode which you then feed into the lxml parser, see http://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings So (untested) you should probably rather do parser=html_parser) Maybe you'd even want to use lxml.html module which also has a parse function. I guess it'll handle the parser stuff behind the scenes. Best regards, Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
Hello, Thanks for your response! Unfortunately the problem still remains with your modified code. I also tried passing the un-decoded bytes of the requests response (r.content) and the problem persisted. My suspicion is that it's to do with how libxml tries to parse the non-entitied `<` character as the start of a tag. Any other ideas? Thanks! On Mon, 25 Jan 2016 at 15:38 Holger Joukl <Holger.Joukl@lbbw.de> wrote:
Hi,
I was doing some tests with lxml and decided to try it out on the test response pages of httpbin.org.
Lxml fails to 'correctly' parse the example utf8 example page supplied by httpbin.org. The page can be found here: http://httpbin.org/encoding/utf8.
Here is a reproduction of the case:
> import requests > r = requests.get("http://httpbin.org/encoding/utf8") > html = r.text > print(html) [...]
> from lxml import etree > etree_parser = etree.HTMLParser(encoding='utf-8') > tree = etree.fromstring(html, parser=etree_parser) > new_html = etree.tostring(tree, method='html', encoding='utf-8') > print(new_html) [...]
The new_html is truncated after a `<` character in the `pre` tag of the original response. I presume this is because lxml attempts to interpret the `<` character as the start of an html tag.
Does lxml have any heuristics for deciding whether to interpret a lone `<` character as a text character as opposed to a html tag initiator?
I suspect you may suffer from requests decoding the html resource text to unicode which you then feed into the lxml parser, see http://lxml.de/FAQ.html#why-can-t-lxml-parse-my-xml-from-unicode-strings
from lxml import etree html_parser = etree.HTMLParser() tree = etree.parse("http://httpbin.org/encoding/utf8",
So (untested) you should probably rather do parser=html_parser)
Maybe you'd even want to use lxml.html module which also has a parse function. I guess it'll handle the parser stuff behind the scenes.
Best regards, Holger
Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
_________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
On Monday, January 25, 2016 18:11:12 Austin Platt wrote:
Hello,
Thanks for your response!
Unfortunately the problem still remains with your modified code. I also tried passing the un-decoded bytes of the requests response (r.content) and the problem persisted.
My suspicion is that it's to do with how libxml tries to parse the non-entitied `<` character as the start of a tag. Any other ideas?
Sorry, I pretty much misread your mail - you don't run into any parser exception but the parsed and re-serialized content isn't what you expect. I agree that the "non-entitied" '<' character seems to be the problem (which probably means the source document is actually broken HTML). Looks like you could still make it work with the help of BeautifulSoup:
import requests from lxml import etree import lxml.html.soupparser resp = requests.get("http://httpbin.org/encoding/utf8") root = lxml.html.soupparser.fromstring(resp.text, features='html.parser') print etree.tostring(root, encoding='utf-8')
From glancing at it this looks pretty much like the original apart from some HTML-sanitizing, namely using character entities and proper (root) elements (I haven't properly compared characters).
Parsing will probably be way slower than through libxml2's HTML parser, though. Holger
participants (3)
-
Austin Platt
-
Holger Joukl
-
Holger Joukl