[lxml-dev] Error (?) with UTF-8 document and Python unicode repr.
Hi. First of all, thanks for a great XML/HTML library! API is really good thought. I'm coming here with a problem with HTML doc encoded with UTF-8: $ cat test_doc.html <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>A*?Ä?ka</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <h1>GdaA*?sk</h1> </body> </html> ("title" and "h1" contents are utf-8 strings, decodable to latin2). From raw Python everything seems to be as expected:
sdata = open('test_doc.html').read() sdata[219:240] '<title>\xc5\x81\xc4\x85ka</title>' udata = unicode(sdata, 'utf-8') udata[219:240] u'<title>\u0141\u0105ka</title>\n ' print udata[219:240].encode('latin2') <title>Łąka</title>
The last statement prints as expected on my console with latin2 charset. But when using lxml something strange happens:
from lxml import etree t = etree.parse(open('test_doc.html'), etree.HTMLParser())
Now getting title element text:
t.getroot()[0][0].text u'\xc5\x81\xc4\x85ka'
t.getroot()[0][0].text.encode('latin2') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/share/python2.5/encodings/iso8859_2.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character u'\xc5' in
This is strange, because this is a unicode string (as indicated by the first "u") but it's representation printed to console is the same as raw bytes from 'sdata' var. I would expect that it should be equal to contents to 'udata' var. As a consequence converting to latin2 doesn't work: position 0: character maps to <undefined> If it's not an error, please tell me. For now I cannot even find any reasonable workaround. I'm using the latest lxml 1.3.6. Thanks for looking at this problem, Regards, Artur
participants (2)
-
Artur Siekielski
-
Frederik Elwert