Mailman 3 Re: [lxml] Parsing HTML files with HTML entities - lxml - The Python XML Toolkit

April 20, 2011

      Henrik, 20.04.2011 16:05:
...
Hello list,
I've searched around but can't find an answer on this.
The problem is that if I parse some HTML which have certain characters converted to HTML enties i.eö they are stripped away.
I.e<h1>Björn</h1>  becomes<h1>Bjrn</h1>
I'm using lxml 2.3 on Mac OS X 10.6
The parser is setup up like this;
parser = html.XHTMLParser(recover=True, ns_clean=True, remove_blank_text=True, resolve_entities=False)
Using "recover=True" with an XML parser has a clear smell to me. Fix your 
data instead.

Regarding XHTML entity references, consider reading the documentation:

http://lxml.de/api/lxml.html.XHTMLParser-class.html

Stefan

Re: [lxml] Parsing HTML files with HTML entities

Stefan Behnel

tags

participants (1)