import lxml.html lxml.html.parse(u'<?xml version="1.0" encoding="utf-8"? <html><body><p>\xa9</p></body></html>'.encode('utf-8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/private/var/folders/Qk/QkUDmW61GouWAJY+hKgwL++++TI/-Tmp-/ tmp0Dd1ub/lxml-2.1.2-py2.5-macosx-10.5-i386.egg/lxml/html/ __init__.py", line 651, in parse File "lxml.etree.pyx", line 2578, in lxml.etree.parse (src/lxml/ lxml.etree.c:25269) File "parser.pxi", line 1466, in lxml.etree._parseDocument (src/ lxml/lxml.etree.c:63768) File "parser.pxi", line 1495, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:64012) File "parser.pxi", line 1395, in lxml.etree._parseDocFromFile (src/ lxml/lxml.etree.c:63169) File "parser.pxi", line 969, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:60461) File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:
Hi all, Unfortunately, I'm running into an error that I thought I had licked before. I've running lxml 2.1.2 on OS X and python 2.5. I have a 'str' object that contains html with utf-8 bytes and a utf-8 encoding specified by the directive, which should be properly handled, to my understanding, but is not: douglas$ python Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. 56751) File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/ lxml/lxml.etree.c:57595) File "parser.pxi", line 558, in lxml.etree._raiseParseError (src/ lxml/lxml.etree.c:56936) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 53: ordinal not in range(128)
Why is ascii being used as a codec? It's properly identified in the string. It's a valid character (in this case a copyright symbol). What can I do?