Mailman 3 utf-8 unicode in lxml - lxml - The Python XML Toolkit

13 Feb 2018

      Hi,

The following example shows that utf-8 characters are not maintained.
(α becomes Î±)

Does anybody know how to fix the problem? Thanks.

$ cat main.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:

import sys
from lxml import html
doc = html.parse(sys.stdin)
print doc.xpath('//div')[0].text
print doc.xpath('//div')[0].text_content()
$ cat main.html
<html>
  <body>
    <div>NT-PGC-1α</div>
  </body>
</html>
$ file main.html
main.html: HTML document text, UTF-8 Unicode text
$ ./main.py < main.html
Î±
Î±

-- 
Regards,
Peng

utf-8 unicode in lxml

Peng Yu

Pedro Andres Aranda Gutierrez

Peng Yu

Burak Arslan

Holger Joukl

Peng Yu

Pedro Andres Aranda Gutierrez

Holger Joukl

tags

participants (4)