On 14 February 2018 at 15:40, Peng Yu <pengyu.ut@gmail.com> wrote:

> You might be bitten by the behaviour described in this bug report:
>
> https://bugs.launchpad.net/lxml/+bug/1002581
>
> Maybe the workarounds sketched there are of some help for you.
>
> It looks like libmxml2 does different things for XML vs HTML parsing
> wrt to encodings, e.g. different default encoding assumptions
> (also depending on iconv support in your environment).
>
> You can see this if you try etree.parse() instead of html.parse(),
> which works for this simple example as the HTML happens to be well-formed
> XML:
>
> $ cat main_etree.py
> import sys
> from lxml import html, etree
> doc = etree.parse(sys.stdin)
> print doc.xpath('//div')[0].text
> $ python2.7 main_etree.py < main.html
> NT-PGC-1α

I need to use text_content() besides just 'text'. But text_content()
does not exist in etree. What is the substitute for text_content() in
etree?

$ cat main.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:

import sys
from lxml import etree
tree = etree.parse(sys.stdin, parser=etree.HTMLParser(encoding='utf-8'))
print(tree.xpath('//div')[0].text)
print(tree.xpath('//div')[0].text_content())

$ cat main.sh
#!/usr/bin/env bash
# vim: set noexpandtab tabstop=2:

./main.py <<EOF
<html><body><div>α</div></body></html>
EOF
$ ./main.sh
α
Traceback (most recent call last):
File "./main.py", line 8, in <module>
print(tree.xpath('//div')[0].text_content())
AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'

--
Regards,
Peng

_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml@lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml

Fragen sind nicht da um beantwortet zu werden,

Fragen sind da um gestellet zu werden

Georg Kreisler