[lxml] Parsing HTML files with HTML entities

April 20, 2011

      Hello list,

I've searched around but can't find an answer on this.

The problem is that if I parse some HTML which have certain characters converted to HTML enties i.e ö they are stripped away.

I.e <h1>Björn</h1> becomes <h1>Bjrn</h1>

I'm using lxml 2.3 on Mac OS X 10.6

The parser is setup up like this;

parser = html.XHTMLParser(recover=True, ns_clean=True, remove_blank_text=True, resolve_entities=False)

//Henrik