New subject: [lxml-dev] Error (?) with UTF-8 document and Python unicode repr.

Nov. 28, 2007

      Hi.

First of all, thanks for a great XML/HTML library! API is really good 
thought.

I'm coming here with a problem with HTML doc encoded with UTF-8:

$ cat test_doc.html
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
       <title>A*?Ä?ka</title>
       <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
       <h1>GdaA*?sk</h1>
</body>
</html>

("title" and "h1" contents are utf-8 strings, decodable to latin2).
 From raw Python everything seems to be as expected:
...
...
...
sdata = open('test_doc.html').read()
sdata[219:240]
'<title>\xc5\x81\xc4\x85ka</title>'
udata = unicode(sdata, 'utf-8')
udata[219:240]
u'<title>\u0141\u0105ka</title>\n '
print udata[219:240].encode('latin2')
<title>Łąka</title>
The last statement prints as expected on my console with latin2 charset. 
But when using lxml something strange happens:
...
...
...
from lxml import etree
t = etree.parse(open('test_doc.html'), etree.HTMLParser())
Now getting title element text:
...
...
...
t.getroot()[0][0].text
u'\xc5\x81\xc4\x85ka'
...
...
...
t.getroot()[0][0].text.encode('latin2')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/share/python2.5/encodings/iso8859_2.py", line 12, in encode
     return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xc5' in
This is strange, because this is a unicode string (as indicated by the 
first "u") but it's representation printed to console is the same as raw 
bytes from 'sdata' var. I would expect that it should be equal to 
contents to 'udata' var. As a consequence converting to latin2 doesn't work:

position 0: character maps to <undefined>

If it's not an error, please tell me. For now I cannot even find any 
reasonable workaround.
I'm using the latest lxml 1.3.6.

Thanks for looking at this problem,
Regards,
Artur

[lxml-dev] Error (?) with UTF-8 document and Python unicode repr.

Artur Siekielski

Frederik Elwert

Artur Siekielski

Frederik Elwert

Artur Siekielski

Frederik Elwert

Artur Siekielski

Frederik Elwert

Frederik Elwert

Artur Siekielski

Frederik Elwert

Artur Siekielski

Frederik Elwert

Artur Siekielski

Frederik Elwert

tags

participants (2)