Hi,
The following example shows that utf-8 characters are not maintained.
(α becomes α)
Does anybody know how to fix the problem? Thanks.
$ cat main.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys
from lxml import html
doc = html.parse(sys.stdin)
print doc.xpath('//div')[0].text
print doc.xpath('//div')[0].text_content()
$ cat main.html
<html>
<body>
<div>NT-PGC-1α</div>
</body>
</html>
$ file main.html
main.html: HTML document text, UTF-8 Unicode text
$ ./main.py < main.html
α
α
--
Regards,
Peng