Mailman 3 [lxml-dev] Error (?) with UTF-8 document and Python unicode repr. - lxml - The Python XML Toolkit

28 Nov 2007

      Hi.

First of all, thanks for a great XML/HTML library! API is really good 
thought.

I'm coming here with a problem with HTML doc encoded with UTF-8:

$ cat test_doc.html
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
       <title>A*?Ä?ka</title>
       <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
       <h1>GdaA*?sk</h1>
</body>
</html>

("title" and "h1" contents are utf-8 strings, decodable to latin2).
 From raw Python everything seems to be as expected:
...
...
...
sdata = open('test_doc.html').read()
sdata[219:240]
'<title>\xc5\x81\xc4\x85ka</title>'
udata = unicode(sdata, 'utf-8')
udata[219:240]
u'<title>\u0141\u0105ka</title>\n '
print udata[219:240].encode('latin2')
<title>Łąka</title>
The last statement prints as expected on my console with latin2 charset. 
But when using lxml something strange happens:
...
...
...
from lxml import etree
t = etree.parse(open('test_doc.html'), etree.HTMLParser())
Now getting title element text:
...
...
...
t.getroot()[0][0].text
u'\xc5\x81\xc4\x85ka'
...
...
...
t.getroot()[0][0].text.encode('latin2')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/share/python2.5/encodings/iso8859_2.py", line 12, in encode
     return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xc5' in
This is strange, because this is a unicode string (as indicated by the 
first "u") but it's representation printed to console is the same as raw 
bytes from 'sdata' var. I would expect that it should be equal to 
contents to 'udata' var. As a consequence converting to latin2 doesn't work:

position 0: character maps to <undefined>

If it's not an error, please tell me. For now I cannot even find any 
reasonable workaround.
I'm using the latest lxml 1.3.6.

Thanks for looking at this problem,
Regards,
Artur

[lxml-dev] Error (?) with UTF-8 document and Python unicode repr.

Artur Siekielski

tags

participants (2)