Have you tried with Pyhton3. LXML seems to have better UTF-8 support with Python 3. Also make sure that you call the script with LC_ALL=C. That should make the script run...
It does not work. $ cat main.py #!/usr/bin/env python3 # vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8: import sys from lxml import html doc = html.parse(sys.stdin) print(doc.xpath('//div')[0].text) print(doc.xpath('//div')[0].text_content()) $ LC_ALL=C python3 ./main.py < main.html Traceback (most recent call last): File "./main.py", line 6, in <module> doc = html.parse(sys.stdin) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/lxml/html/__init__.py", line 940, in parse return etree.parse(filename_or_url, parser, base_url=base_url, **kw) File "src/lxml/etree.pyx", line 3444, in lxml.etree.parse (src/lxml/etree.c:83185) File "src/lxml/parser.pxi", line 1855, in lxml.etree._parseDocument (src/lxml/etree.c:121025) File "src/lxml/parser.pxi", line 1875, in lxml.etree._parseFilelikeDocument (src/lxml/etree.c:121308) File "src/lxml/parser.pxi", line 1770, in lxml.etree._parseDocFromFilelike (src/lxml/etree.c:120092) File "src/lxml/parser.pxi", line 1185, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/etree.c:114820) File "src/lxml/parser.pxi", line 598, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/etree.c:107738) File "src/lxml/parser.pxi", line 705, in lxml.etree._handleParseResult (src/lxml/etree.c:109406) File "src/lxml/etree.pyx", line 326, in lxml.etree._ExceptionContext._raise_if_stored (src/lxml/etree.c:13259) File "src/lxml/parser.pxi", line 380, in lxml.etree._FileReaderContext.copyToBuffer (src/lxml/etree.c:105164) UnicodeEncodeError: 'utf-8' codec can't encode characters in position 25-26: surrogates not allowed -- Regards, Peng