Hi re. text() vs text_content(), have you investigated including the text() as part of the XPATH expression? re python3 vs python2, sorry it didn't work out, it was just a suggestion of a path to follow Best, /PA On 14 February 2018 at 15:40, Peng Yu <pengyu.ut@gmail.com> wrote:
You might be bitten by the behaviour described in this bug report:
https://bugs.launchpad.net/lxml/+bug/1002581
Maybe the workarounds sketched there are of some help for you.
It looks like libmxml2 does different things for XML vs HTML parsing wrt to encodings, e.g. different default encoding assumptions (also depending on iconv support in your environment).
You can see this if you try etree.parse() instead of html.parse(), which works for this simple example as the HTML happens to be well-formed XML:
$ cat main_etree.py import sys from lxml import html, etree doc = etree.parse(sys.stdin) print doc.xpath('//div')[0].text $ python2.7 main_etree.py < main.html NT-PGC-1α
I need to use text_content() besides just 'text'. But text_content() does not exist in etree. What is the substitute for text_content() in etree?
$ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys from lxml import etree tree = etree.parse(sys.stdin, parser=etree.HTMLParser(encoding='utf-8')) print(tree.xpath('//div')[0].text) print(tree.xpath('//div')[0].text_content())
$ cat main.sh #!/usr/bin/env bash # vim: set noexpandtab tabstop=2:
./main.py <<EOF <html><body><div>α</div></body></html> EOF $ ./main.sh α Traceback (most recent call last): File "./main.py", line 8, in <module> print(tree.xpath('//div')[0].text_content()) AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'
-- Regards, Peng _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Fragen sind nicht da um beantwortet zu werden, Fragen sind da um gestellet zu werden Georg Kreisler