The following example shows that utf-8 characters are not maintained. (α becomes α)
Does anybody know how to fix the problem? Thanks.
$ cat main.py #!/usr/bin/env python # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys from lxml import html doc = html.parse(sys.stdin) print doc.xpath('//div')[0].text print doc.xpath('//div')[0].text_content() $ cat main.html <html> <body> <div>NT-PGC-1α</div> </body> </html> $ file main.html main.html: HTML document text, UTF-8 Unicode text $ ./main.py < main.html α α
You might be bitten by the behaviour described in this bug report: https://bugs.launchpad.net/lxml/+bug/1002581 Maybe the workarounds sketched there are of some help for you. It looks like libmxml2 does different things for XML vs HTML parsing wrt to encodings, e.g. different default encoding assumptions (also depending on iconv support in your environment). You can see this if you try etree.parse() instead of html.parse(), which works for this simple example as the HTML happens to be well-formed XML: $ cat main_etree.py import sys from lxml import html, etree doc = etree.parse(sys.stdin) print doc.xpath('//div')[0].text $ python2.7 main_etree.py < main.html NT-PGC-1α Holger Landesbank Baden-Wuerttemberg Anstalt des oeffentlichen Rechts Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz HRA 12704 Amtsgericht Stuttgart