[lxml-dev] LXML utf-8 problem...

20 Feb 2009

      ...
...
...
import lxml.html
lxml.html.parse(u'<?xml version="1.0" encoding="utf-8"? 
<html><body><p>\xa9</p></body></html>'.encode('utf-8'))
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/private/var/folders/Qk/QkUDmW61GouWAJY+hKgwL++++TI/-Tmp-/ 
tmp0Dd1ub/lxml-2.1.2-py2.5-macosx-10.5-i386.egg/lxml/html/ 
__init__.py", line 651, in parse
   File "lxml.etree.pyx", line 2578, in lxml.etree.parse (src/lxml/ 
lxml.etree.c:25269)
   File "parser.pxi", line 1466, in lxml.etree._parseDocument (src/ 
lxml/lxml.etree.c:63768)
   File "parser.pxi", line 1495, in lxml.etree._parseDocumentFromURL  
(src/lxml/lxml.etree.c:64012)
   File "parser.pxi", line 1395, in lxml.etree._parseDocFromFile (src/ 
lxml/lxml.etree.c:63169)
   File "parser.pxi", line 969, in  
lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:60461)
   File "parser.pxi", line 538, in  
lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:
Hi all,
	Unfortunately, I'm running into an error that I thought I had licked  
before.  I've running lxml 2.1.2 on OS X and python 2.5.  I have a  
'str' object that contains html with utf-8 bytes and a utf-8 encoding  
specified by the directive, which should be properly handled, to my  
understanding, but is not:

douglas$ python
Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
56751)
   File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/ 
lxml/lxml.etree.c:57595)
   File "parser.pxi", line 558, in lxml.etree._raiseParseError (src/ 
lxml/lxml.etree.c:56936)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position  
53: ordinal not in range(128)
...
...
...
Why is ascii being used as a codec?  It's properly identified in the  
string.  It's a valid character (in this case a copyright symbol).   
What can I do?

[lxml-dev] LXML utf-8 problem...

Douglas Mayle