Parsing HTML files with HTML entities
Hello list, I've searched around but can't find an answer on this. The problem is that if I parse some HTML which have certain characters converted to HTML enties i.e ö they are stripped away. I.e <h1>Björn</h1> becomes <h1>Bjrn</h1> I'm using lxml 2.3 on Mac OS X 10.6 The parser is setup up like this; parser = html.XHTMLParser(recover=True, ns_clean=True, remove_blank_text=True, resolve_entities=False) //Henrik
Hi,
The problem is that if I parse some HTML which have certain characters converted to HTML enties i.e ö they are stripped away.
I.e <h1>Björn</h1> becomes <h1>Bjrn</h1>
I'm using lxml 2.3 on Mac OS X 10.6
The parser is setup up like this;
parser = html.XHTMLParser(recover=True, ns_clean=True, remove_blank_text=True, resolve_entities=False)
I'd say your document's not valid xhtml (=xml), which you can see if you switch off the parser's recover option:
htmldoc = "<h1>Björn</h1>">>> parser = html.XHTMLParser(recover=False, ns_clean=True, remove_blank_text=True, resolve_entities=False) html.parse(StringIO(htmldoc), parser=parser) Traceback (most recent call last): File "<stdin>", line 1, in ? File "build/bdist.solaris-2.8-sun4u/sunpkg/lib/python2.4/site-packages/lxml/html/__init__.py", line 661, in parse File "lxml.etree.pyx", line 2706, in lxml.etree.parse (src/lxml/lxml.etree.c:49958) File "parser.pxi", line 1517, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71973) File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:72245) File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:71106) File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67875) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:64257) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:65178) File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64521) lxml.etree.XMLSyntaxError: Entity 'ouml' not defined, line 1, column 13
Holger -- NEU: FreePhone - kostenlos mobil telefonieren und surfen! Jetzt informieren: http://www.gmx.net/de/go/freephone
participants (2)
-
Henrik
-
jholg@gmx.de