Re: [lxml] Parsing HTML files with HTML entities
April 20, 2011
10:49 p.m.
Henrik, 20.04.2011 16:05:
Hello list,
I've searched around but can't find an answer on this.
The problem is that if I parse some HTML which have certain characters converted to HTML enties i.eö they are stripped away.
I.e<h1>Björn</h1> becomes<h1>Bjrn</h1>
I'm using lxml 2.3 on Mac OS X 10.6
The parser is setup up like this;
parser = html.XHTMLParser(recover=True, ns_clean=True, remove_blank_text=True, resolve_entities=False)
Using "recover=True" with an XML parser has a clear smell to me. Fix your data instead. Regarding XHTML entity references, consider reading the documentation: http://lxml.de/api/lxml.html.XHTMLParser-class.html Stefan
5027
Age (days ago)
5027
Last active (days ago)
0 comments
1 participants
participants (1)
-
Stefan Behnel