Re: [lxml-dev] html entities and lxml.html.ElementSoup
Hi Viksit, What you typed was correct, except you have to note that lxml.html.soupparser.convert_tree(soup) returns a *list* of root elements, so you can't just do a lxml.etree.tostring() on the list. Depending on your HTML, choosing the first element will probably work. I have moved to the trunk now, so am working well with the new lxml.html.soupparser. But if you're stuck on that branch, then that work-around worked for me. Hope it works for you! cheers -Roger 2008/5/14 Viksit Gaur <viksit@aya.yale.edu>:
Hi there,
Roger Patterson wrote:
I'm getting an interesting situation. When using the very cool ElementSoup add-on to lxml.html with certain source-html files that already encode entities (eg. £), using the ElementSoup.parse() messes up the entities.
I'm running into the same problem.
It looks like it's not the parse(), but rather the serialisation. What
happens is that the entity references end up in the /text/ content, which is clearly wrong as it leads to re-escaping of the references on the way out.
What I'm currently doing to solve this is first parsing it with BeautifulSoup(html, convertEntities="html"), then calling ElementSoup.convert_tree(soup). This work-around works fine, but I thought I'd bring it to your attention.
Did you mean something of the sort,
soup = BeautifulSoup(doc, convertEntities="html") root = lxml.html.soupparser.convert_tree(soup)
Because I get an error of the form:
File "lxml.etree.pyx", line 2491, in lxml.etree.tostring (src/lxml/lxml.etree.c:21792) TypeError: Type 'list' cannot be serialized.
ElementSoup should do that for you. I fixed it on the trunk.
Stefan
Unfortunately, I can't switch to lxml trunk. Would it be possible for you to point me to the code change in lxml so I can patch it myself?
Thanks and Cheers, Viksit
participants (1)
-
roger patterson