Re: [lxml] utf-8 unicode in lxml

14 Feb 2018

...
The following example shows that utf-8 characters are not maintained.
(α becomes Î±)
Does anybody know how to fix the problem? Thanks.
$ cat main.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1
fileencoding=utf-8:
import sys
from lxml import html
doc = html.parse(sys.stdin)
print doc.xpath('//div')[0].text
print doc.xpath('//div')[0].text_content()
$ cat main.html
<html>
  <body>
    <div>NT-PGC-1α</div>
  </body>
</html>
$ file main.html
main.html: HTML document text, UTF-8 Unicode text
$ ./main.py < main.html
Î±
Î±
You might be bitten by the behaviour described in this bug report:

	https://bugs.launchpad.net/lxml/+bug/1002581

Maybe the workarounds sketched there are of some help for you.

It looks like libmxml2 does different things for XML vs HTML parsing
wrt to encodings, e.g. different default encoding assumptions
(also depending on iconv support in your environment).

You can see this if you try etree.parse() instead of html.parse(),
which works for this simple example as the HTML happens to be well-formed
XML:

$ cat main_etree.py
import sys
from lxml import html, etree
doc = etree.parse(sys.stdin)
print doc.xpath('//div')[0].text
$ python2.7 main_etree.py < main.html
NT-PGC-1α

Holger

Landesbank Baden-Wuerttemberg
Anstalt des oeffentlichen Rechts
Hauptsitze: Stuttgart, Karlsruhe, Mannheim, Mainz
HRA 12704
Amtsgericht Stuttgart

Re: [lxml] utf-8 unicode in lxml

Holger Joukl