April 20, 2011
7:05 a.m.
Hello list, I've searched around but can't find an answer on this. The problem is that if I parse some HTML which have certain characters converted to HTML enties i.e ö they are stripped away. I.e <h1>Björn</h1> becomes <h1>Bjrn</h1> I'm using lxml 2.3 on Mac OS X 10.6 The parser is setup up like this; parser = html.XHTMLParser(recover=True, ns_clean=True, remove_blank_text=True, resolve_entities=False) //Henrik