Re: [lxml] Parsing HTML files with HTML entities

On 20 apr 2011, at 18.51, Stefan Behnel wrote:
Henrik, 20.04.2011 16:05:
Hello list,
I've searched around but can't find an answer on this.
The problem is that if I parse some HTML which have certain characters converted to HTML enties i.eö they are stripped away.
I.e<h1>Björn</h1> becomes<h1>Bjrn</h1>
I'm using lxml 2.3 on Mac OS X 10.6
The parser is setup up like this;
parser = html.XHTMLParser(recover=True, ns_clean=True, remove_blank_text=True, resolve_entities=False)
Using "recover=True" with an XML parser has a clear smell to me. Fix your data instead.
Regarding XHTML entity references, consider reading the documentation:
http://lxml.de/api/lxml.html.XHTMLParser-class.html
Stefan
I can see the problem. I'm using lxml to manipulate my html document so the validation is not so important. So it would be great if LXML would not automatically strip all HTML entities if you do not load a DTD. But I can see that the solution now is to specify a DTD that defines all the HTML entites. But how can I specify a DTD and use the ther lxml.html.fragment_fromstring()? When I parse a fragment, the fragment does not have a doctype declared. Only my full html document has that. And I can't see a way of specifying a DTD in the parser options. Suggestions? Thanks for all your help! //Henrik
participants (1)
-
Henrik