Mailman 3 Re: [lxml-dev] html entities and lxml.html.ElementSoup - lxml - The Python XML Toolkit

14 May 2008

      Hi Viksit,

What you typed was correct, except you have to note that
lxml.html.soupparser.convert_tree(soup) returns a *list* of root
elements, so you can't just do a lxml.etree.tostring() on the list.
Depending on your HTML, choosing the first element will probably work.

I have moved to the trunk now, so am working well with the new
lxml.html.soupparser.  But if you're stuck on that branch, then that
work-around worked for me.  Hope it works for you!
cheers
-Roger

2008/5/14 Viksit Gaur <viksit@aya.yale.edu>:
...
Hi there,
...
Roger Patterson wrote:
...
I'm getting an interesting situation.  When using the very cool
ElementSoup add-on to lxml.html with certain source-html files that
already encode entities (eg. £), using the ElementSoup.parse()
messes up the entities.
I'm running into the same problem.
...
It looks like it's not the parse(), but rather the serialisation. What
...
happens
is that the entity references end up in the /text/ content, which is
clearly
wrong as it leads to re-escaping of the references on the way out.
...
...
What I'm currently doing to solve this is first parsing it with
BeautifulSoup(html, convertEntities="html"), then calling
ElementSoup.convert_tree(soup).  This work-around works fine, but I
thought I'd bring it to your attention.
Did you mean something of the sort,
soup = BeautifulSoup(doc, convertEntities="html")
root = lxml.html.soupparser.convert_tree(soup)
Because I get an error of the form:
File "lxml.etree.pyx", line 2491, in lxml.etree.tostring
(src/lxml/lxml.etree.c:21792)
TypeError: Type 'list' cannot be serialized.
...
ElementSoup should do that for you. I fixed it on the trunk.
...
Stefan
Unfortunately, I can't switch to lxml trunk. Would it be possible for you to
point me to the code change in lxml so I can patch it myself?
Thanks and Cheers,
Viksit

Re: [lxml-dev] html entities and lxml.html.ElementSoup

roger patterson

tags

participants (1)