Hi, The following example shows that utf-8 characters are not maintained. (α becomes α) Does anybody know how to fix the problem? Thanks. $ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: import sys from lxml import html doc = html.parse(sys.stdin) print doc.xpath('//div')[0].text print doc.xpath('//div')[0].text_content() $ cat main.html <html> <body> <div>NT-PGC-1α</div> </body> </html> $ file main.html main.html: HTML document text, UTF-8 Unicode text $ ./main.py < main.html α α -- Regards, Peng
Have you tried with Pyhton3. LXML seems to have better UTF-8 support with Python 3. Also make sure that you call the script with LC_ALL=C. That should make the script run... Best, /PA On 14 February 2018 at 01:50, Peng Yu <pengyu.ut@gmail.com> wrote:
Hi,
The following example shows that utf-8 characters are not maintained. (α becomes α)
Does anybody know how to fix the problem? Thanks.
$ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys from lxml import html doc = html.parse(sys.stdin) print doc.xpath('//div')[0].text print doc.xpath('//div')[0].text_content() $ cat main.html <html> <body> <div>NT-PGC-1α</div> </body> </html> $ file main.html main.html: HTML document text, UTF-8 Unicode text $ ./main.py < main.html α α
-- Regards, Peng _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Fragen sind nicht da um beantwortet zu werden, Fragen sind da um gestellet zu werden Georg Kreisler
Have you tried with Pyhton3. LXML seems to have better UTF-8 support with Python 3. Also make sure that you call the script with LC_ALL=C. That should make the script run...
It does not work. $ cat main.py #!/usr/bin/env python3 # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: import sys from lxml import html doc = html.parse(sys.stdin) print(doc.xpath('//div')[0].text) print(doc.xpath('//div')[0].text_content()) $ LC_ALL=C python3 ./main.py < main.html Traceback (most recent call last): File "./main.py", line 6, in <module> doc = html.parse(sys.stdin) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/lxml/html/__init__.py", line 940, in parse return etree.parse(filename_or_url, parser, base_url=base_url, **kw) File "src/lxml/etree.pyx", line 3444, in lxml.etree.parse (src/lxml/etree.c:83185) File "src/lxml/parser.pxi", line 1855, in lxml.etree._parseDocument (src/lxml/etree.c:121025) File "src/lxml/parser.pxi", line 1875, in lxml.etree._parseFilelikeDocument (src/lxml/etree.c:121308) File "src/lxml/parser.pxi", line 1770, in lxml.etree._parseDocFromFilelike (src/lxml/etree.c:120092) File "src/lxml/parser.pxi", line 1185, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/etree.c:114820) File "src/lxml/parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/etree.c:107738) File "src/lxml/parser.pxi", line 705, in lxml.etree._handleParseResult (src/lxml/etree.c:109406) File "src/lxml/etree.pyx", line 326, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/etree.c:13259) File "src/lxml/parser.pxi", line 380, in lxml.etree._FileReaderContext.copyToBuffer (src/lxml/etree.c:105164) UnicodeEncodeError: 'utf-8' codec can't encode characters in position 25-26: surrogates not allowed -- Regards, Peng
On 14/02/18 17:27, Peng Yu wrote:
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 25-26: surrogates not allowed
I haven't read the rest of the thread, but this error specifically is not lxml's fault. the error message is clear -- surrogates are not allowed. so you need to strip them before feeding the text to lxml. https://stackoverflow.com/a/3158428 Here's what I came up with: XML10_RE = re.compile(u'[^\u0009\n\u0020-\ud7ff\U00010000-\U0010FFFF]', flags=re.UNICODE) Some addt'l xml-related regexps if you need them: https://github.com/arskom/spyne/blob/9ce69afe4fa7139fb1d0c968e66150e1ee19b99... Hth, Burak
The following example shows that utf-8 characters are not maintained. (α becomes α)
Does anybody know how to fix the problem? Thanks.
$ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys from lxml import html doc = html.parse(sys.stdin) print doc.xpath('//div')[0].text print doc.xpath('//div')[0].text_content() $ cat main.html <html> <body> <div>NT-PGC-1α</div> </body> </html> $ file main.html main.html: HTML document text, UTF-8 Unicode text $ ./main.py < main.html α α
You might be bitten by the behaviour described in this bug report: https://bugs.launchpad.net/lxml/+bug/1002581 Maybe the workarounds sketched there are of some help for you. It looks like libmxml2 does different things for XML vs HTML parsing wrt to encodings, e.g. different default encoding assumptions (also depending on iconv support in your environment). You can see this if you try etree.parse() instead of html.parse(), which works for this simple example as the HTML happens to be well-formed XML: $ cat main_etree.py import sys from lxml import html, etree doc = etree.parse(sys.stdin) print doc.xpath('//div')[0].text $ python2.7 main_etree.py < main.html NT-PGC-1α Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
You might be bitten by the behaviour described in this bug report:
https://bugs.launchpad.net/lxml/+bug/1002581
Maybe the workarounds sketched there are of some help for you.
It looks like libmxml2 does different things for XML vs HTML parsing wrt to encodings, e.g. different default encoding assumptions (also depending on iconv support in your environment).
You can see this if you try etree.parse() instead of html.parse(), which works for this simple example as the HTML happens to be well-formed XML:
$ cat main_etree.py import sys from lxml import html, etree doc = etree.parse(sys.stdin) print doc.xpath('//div')[0].text $ python2.7 main_etree.py < main.html NT-PGC-1α
I need to use text_content() besides just 'text'. But text_content() does not exist in etree. What is the substitute for text_content() in etree? $ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: import sys from lxml import etree tree = etree.parse(sys.stdin, parser=etree.HTMLParser(encoding='utf-8')) print(tree.xpath('//div')[0].text) print(tree.xpath('//div')[0].text_content()) $ cat main.sh #!/usr/bin/env bash # vim: set noexpandtab tabstop=2: ./main.py <<EOF <html><body><div>α</div></body></html> EOF $ ./main.sh α Traceback (most recent call last): File "./main.py", line 8, in <module> print(tree.xpath('//div')[0].text_content()) AttributeError: 'lxml.etree._Element' object has no attribute 'text_content' -- Regards, Peng
Hi re. text() vs text_content(), have you investigated including the text() as part of the XPATH expression? re python3 vs python2, sorry it didn't work out, it was just a suggestion of a path to follow Best, /PA On 14 February 2018 at 15:40, Peng Yu <pengyu.ut@gmail.com> wrote:
You might be bitten by the behaviour described in this bug report:
https://bugs.launchpad.net/lxml/+bug/1002581
Maybe the workarounds sketched there are of some help for you.
It looks like libmxml2 does different things for XML vs HTML parsing wrt to encodings, e.g. different default encoding assumptions (also depending on iconv support in your environment).
You can see this if you try etree.parse() instead of html.parse(), which works for this simple example as the HTML happens to be well-formed XML:
$ cat main_etree.py import sys from lxml import html, etree doc = etree.parse(sys.stdin) print doc.xpath('//div')[0].text $ python2.7 main_etree.py < main.html NT-PGC-1α
I need to use text_content() besides just 'text'. But text_content() does not exist in etree. What is the substitute for text_content() in etree?
$ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys from lxml import etree tree = etree.parse(sys.stdin, parser=etree.HTMLParser(encoding='utf-8')) print(tree.xpath('//div')[0].text) print(tree.xpath('//div')[0].text_content())
$ cat main.sh #!/usr/bin/env bash # vim: set noexpandtab tabstop=2:
./main.py <<EOF <html><body><div>α</div></body></html> EOF $ ./main.sh α Traceback (most recent call last): File "./main.py", line 8, in <module> print(tree.xpath('//div')[0].text_content()) AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'
-- Regards, Peng _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml@lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
-- Fragen sind nicht da um beantwortet zu werden, Fragen sind da um gestellet zu werden Georg Kreisler
You might be bitten by the behaviour described in this bug report:
https://bugs.launchpad.net/lxml/+bug/1002581
Maybe the workarounds sketched there are of some help for you.
[...]
I need to use text_content() besides just 'text'. But text_content() does not exist in etree. What is the substitute for text_content() in etree?
$ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys from lxml import etree tree = etree.parse(sys.stdin, parser=etree.HTMLParser(encoding='utf-8')) print(tree.xpath('//div')[0].text) print(tree.xpath('//div')[0].text_content())
$ cat main.sh #!/usr/bin/env bash # vim: set noexpandtab tabstop=2:
./main.py <<EOF <html><body><div>α</div></body></html> EOF $ ./main.sh α Traceback (most recent call last): File "./main.py", line 8, in <module> print(tree.xpath('//div')[0].text_content()) AttributeError: 'lxml.etree._Element' object has no attribute 'text_content'
I suspect that you can't simply use the XML parser in a more general HTML case, unless you can be sure the HTML is also well-formed XML (or make this sure somehow by cleaning it up first). Really depends on your data. Have you tried the workarounds described in the bug report above? Namely " [...] Note that you can work around this by either: - Having <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> in the HTML document, or - Using lxml.etree with an lxml.etree.HTMLParser object, passing encoding='utf-8' to the HTMLParser constructor. [...] " Which allows you to do s.th. like this: import sys from lxml import html parser = html.HTMLParser(encoding='utf-8') doc = html.parse(sys.stdin, parser=parser) print doc.xpath('//div')[0].text print doc.xpath('//div')[0].text_content() Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart
participants (4)
-
Burak Arslan
-
Holger Joukl
-
Pedro Andres Aranda Gutierrez
-
Peng Yu