Have you tried with Pyhton3. LXML seems to have better UTF-8 support with Python 3. Also make sure that you call the script with LC_ALL=C.
That should make the script run...

Best, /PA

On 14 February 2018 at 01:50, Peng Yu <pengyu.ut@gmail.com> wrote:
Hi,

The following example shows that utf-8 characters are not maintained.
(α becomes α)

Does anybody know how to fix the problem? Thanks.

$ cat main.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:

import sys
from lxml import html
doc = html.parse(sys.stdin)
print doc.xpath('//div')[0].text
print doc.xpath('//div')[0].text_content()
$ cat main.html
<html>
  <body>
    <div>NT-PGC-1α</div>
  </body>
</html>
$ file main.html
main.html: HTML document text, UTF-8 Unicode text
$ ./main.py < main.html
α
α

--
Regards,
Peng
_________________________________________________________________
Mailing list for the lxml Python XML toolkit - http://lxml.de/
lxml@lxml.de
https://mailman-mail5.webfaction.com/listinfo/lxml



--
Fragen sind nicht da um beantwortet zu werden,
Fragen sind da um gestellet zu werden
Georg Kreisler