[lxml-dev] LXML utf-8 problem...
![](https://secure.gravatar.com/avatar/f21b43c63c66db01ee80d59512737fe3.jpg?s=120&d=mm&r=g)
import lxml.html lxml.html.parse(u'<?xml version="1.0" encoding="utf-8"? <html><body><p>\xa9</p></body></html>'.encode('utf-8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/private/var/folders/Qk/QkUDmW61GouWAJY+hKgwL++++TI/-Tmp-/ tmp0Dd1ub/lxml-2.1.2-py2.5-macosx-10.5-i386.egg/lxml/html/ __init__.py", line 651, in parse File "lxml.etree.pyx", line 2578, in lxml.etree.parse (src/lxml/ lxml.etree.c:25269) File "parser.pxi", line 1466, in lxml.etree._parseDocument (src/ lxml/lxml.etree.c:63768) File "parser.pxi", line 1495, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:64012) File "parser.pxi", line 1395, in lxml.etree._parseDocFromFile (src/ lxml/lxml.etree.c:63169) File "parser.pxi", line 969, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:60461) File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:
Hi all, Unfortunately, I'm running into an error that I thought I had licked before. I've running lxml 2.1.2 on OS X and python 2.5. I have a 'str' object that contains html with utf-8 bytes and a utf-8 encoding specified by the directive, which should be properly handled, to my understanding, but is not: douglas$ python Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. 56751) File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/ lxml/lxml.etree.c:57595) File "parser.pxi", line 558, in lxml.etree._raiseParseError (src/ lxml/lxml.etree.c:56936) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 53: ordinal not in range(128)
Why is ascii being used as a codec? It's properly identified in the string. It's a valid character (in this case a copyright symbol). What can I do?
![](https://secure.gravatar.com/avatar/8b97b5aad24c30e4a1357b38cc39aeaa.jpg?s=120&d=mm&r=g)
Hi, Douglas Mayle wrote:
Unfortunately, I'm running into an error that I thought I had licked before. I've running lxml 2.1.2 on OS X and python 2.5. I have a 'str' object that contains html with utf-8 bytes and a utf-8 encoding specified by the directive, which should be properly handled, to my understanding, but is not:
douglas$ python Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information.
import lxml.html lxml.html.parse(u'<?xml version="1.0" encoding="utf-8"? <html><body><p>\xa9</p></body></html>'.encode('utf-8')) Traceback (most recent call last): [...] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 53: ordinal not in range(128)
:) The error message is a bit misleading here. parse() takes a file name as argument, which in your case is a UTF-8 encoded byte sequence. When lxml.etree tries to parse, it fails to find the file and thus tries to raise an error. It then fails as it cannot format the error message. Haven't tried, but it should work with 2.2. Stefan
![](https://secure.gravatar.com/avatar/878aac4275260934a80c46e85cd0edf5.jpg?s=120&d=mm&r=g)
On Fri, 2009-02-20 at 15:10 -0500, Douglas Mayle wrote:
Hi all, Unfortunately, I'm running into an error that I thought I had licked before. I've running lxml 2.1.2 on OS X and python 2.5. I have a 'str' object that contains html with utf-8 bytes and a utf-8 encoding specified by the directive, which should be properly handled, to my understanding, but is not:
import lxml.html lxml.html.parse(u'<?xml version="1.0" encoding="utf-8"? <html><body><p>\xa9</p></body></html>'.encode('utf-8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/private/var/folders/Qk/QkUDmW61GouWAJY+hKgwL++++TI/-Tmp-/ tmp0Dd1ub/lxml-2.1.2-py2.5-macosx-10.5-i386.egg/lxml/html/ __init__.py", line 651, in parse File "lxml.etree.pyx", line 2578, in lxml.etree.parse (src/lxml/ lxml.etree.c:25269) File "parser.pxi", line 1466, in lxml.etree._parseDocument (src/ lxml/lxml.etree.c:63768) File "parser.pxi", line 1495, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:64012) File "parser.pxi", line 1395, in lxml.etree._parseDocFromFile (src/ lxml/lxml.etree.c:63169) File "parser.pxi", line 969, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:60461) File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:
douglas$ python Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. 56751) File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/ lxml/lxml.etree.c:57595) File "parser.pxi", line 558, in lxml.etree._raiseParseError (src/ lxml/lxml.etree.c:56936) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 53: ordinal not in range(128)
Why is ascii being used as a codec? It's properly identified in the string. It's a valid character (in this case a copyright symbol). What can I do?
if is what I think could be a problem with python it self ! This code : content = urllib.urlopen(url).read(-1) content = content.decode('cp1252') print content With one page with enconding windows-1252, I print to stdout and I see it well but if I put it on a pipe , like : python getcontent.py | grep something, gives the error that you mention. don't ask me why but adding .encode('utf-8') content = content.decode('cp1252').encode('utf-8') fixes this problem . hope that can help , regards. -- Sérgio M. B.
![](https://secure.gravatar.com/avatar/f21b43c63c66db01ee80d59512737fe3.jpg?s=120&d=mm&r=g)
Actually, after digging further, I found out that it's a problem with the error reporting mechanisms in lxml. If you have unicode data inside of of a 'str' type object (which is normal for many html and xml documents) then the lxml error reporting incorrectly decodes the string while trying to spit out an error, which causes a new error that masks the original error. As mentioned earlier in this thread, it should be fixed in the newest version of lxml. In any case, I copied code from elsewhere in my program and forgot to switch from parse (which takes a filename or url) to fromstring(which takes text data). parse was spitting out an error because it didn't receive a filename, and that error was mixed with the incorrectly decoded data of the filename which caused a new error... Doug On Feb 21, 2009, at 12:56 AM, Sergio Monteiro Basto wrote:
On Fri, 2009-02-20 at 15:10 -0500, Douglas Mayle wrote:
Hi all, Unfortunately, I'm running into an error that I thought I had licked before. I've running lxml 2.1.2 on OS X and python 2.5. I have a 'str' object that contains html with utf-8 bytes and a utf-8 encoding specified by the directive, which should be properly handled, to my understanding, but is not:
import lxml.html lxml.html.parse(u'<?xml version="1.0" encoding="utf-8"? <html><body><p>\xa9</p></body></html>'.encode('utf-8')) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/private/var/folders/Qk/QkUDmW61GouWAJY+hKgwL++++TI/-Tmp-/ tmp0Dd1ub/lxml-2.1.2-py2.5-macosx-10.5-i386.egg/lxml/html/ __init__.py", line 651, in parse File "lxml.etree.pyx", line 2578, in lxml.etree.parse (src/lxml/ lxml.etree.c:25269) File "parser.pxi", line 1466, in lxml.etree._parseDocument (src/ lxml/lxml.etree.c:63768) File "parser.pxi", line 1495, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:64012) File "parser.pxi", line 1395, in lxml.etree._parseDocFromFile (src/ lxml/lxml.etree.c:63169) File "parser.pxi", line 969, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:
douglas$ python Python 2.5.1 (r251:54863, Jul 23 2008, 11:00:16) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type "help", "copyright", "credits" or "license" for more information. 60461) File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/ lxml.etree.c: 56751) File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/ lxml/lxml.etree.c:57595) File "parser.pxi", line 558, in lxml.etree._raiseParseError (src/ lxml/lxml.etree.c:56936) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 53: ordinal not in range(128)
Why is ascii being used as a codec? It's properly identified in the string. It's a valid character (in this case a copyright symbol). What can I do?
if is what I think could be a problem with python it self ! This code : content = urllib.urlopen(url).read(-1) content = content.decode('cp1252') print content
With one page with enconding windows-1252, I print to stdout and I see it well but if I put it on a pipe , like : python getcontent.py | grep something, gives the error that you mention.
don't ask me why but adding .encode('utf-8')
content = content.decode('cp1252').encode('utf-8')
fixes this problem .
hope that can help , regards. -- Sérgio M. B.
participants (3)
-
Douglas Mayle
-
Sergio Monteiro Basto
-
Stefan Behnel