Mailman 3 Lxml fails to parse httpbin.org example utf-8 page - lxml - The Python XML Toolkit

25 Jan 2016

      Hello all,

I was doing some tests with lxml and decided to try it out on the test
response pages of httpbin.org.

Lxml fails to 'correctly' parse the example utf8 example page supplied by
httpbin.org. The page can be found here: http://httpbin.org/encoding/utf8.

Here is a reproduction of the case:

    > import requests
    > r = requests.get("http://httpbin.org/encoding/utf8")
    > html = r.text
    > print(html)
    [...]

    > from lxml import etree
    > etree_parser = etree.HTMLParser(encoding='utf-8')
    > tree = etree.fromstring(html, parser=etree_parser)
    > new_html = etree.tostring(tree, method='html', encoding='utf-8')
    > print(new_html)
    [...]

The new_html is truncated after a `<` character in the `pre` tag of the
original response. I presume this is because lxml attempts to interpret the
`<` character as the start of an html tag.

Does lxml have any heuristics for deciding whether to interpret a lone `<`
character as a text character as opposed to a html tag initiator?

Cheers
Austin

Lxml fails to parse httpbin.org example utf-8 page

Austin Platt

tags

participants (3)