You might be bitten by the behaviour described in this bug report:
https://bugs.launchpad.net/lxml/+bug/1002581
Maybe the workarounds sketched there are of some help for you.
[...]
I need to use text_content() besides just 'text'. But text_content() does not exist in etree. What is the substitute for text_content() in etree?
$ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys from lxml import etree tree = etree.parse(sys.stdin, parser=etree.HTMLParser(encoding='utf-8')) print(tree.xpath('//div')[0].text) print(tree.xpath('//div')[0].text_content())
$ cat main.sh #!/usr/bin/env bash # vim: set noexpandtab tabstop=2:
./main.py <
I suspect that you can't simply use the XML parser in a more general HTML case, unless you can be sure the HTML is also well-formed XML (or make this sure somehow by cleaning it up first). Really depends on your data. Have you tried the workarounds described in the bug report above? Namely " [...] Note that you can work around this by either: - Having <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> in the HTML document, or - Using lxml.etree with an lxml.etree.HTMLParser object, passing encoding='utf-8' to the HTMLParser constructor. [...] " Which allows you to do s.th. like this: import sys from lxml import html parser = html.HTMLParser(encoding='utf-8') doc = html.parse(sys.stdin, parser=parser) print doc.xpath('//div')[0].text print doc.xpath('//div')[0].text_content() Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart