[lxml-dev] Error (?) with UTF-8 document and Python unicode repr.

Hi. First of all, thanks for a great XML/HTML library! API is really good thought. I'm coming here with a problem with HTML doc encoded with UTF-8: $ cat test_doc.html <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>A*?Ä?ka</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <h1>GdaA*?sk</h1> </body> </html> ("title" and "h1" contents are utf-8 strings, decodable to latin2). From raw Python everything seems to be as expected:
sdata = open('test_doc.html').read() sdata[219:240] '<title>\xc5\x81\xc4\x85ka</title>' udata = unicode(sdata, 'utf-8') udata[219:240] u'<title>\u0141\u0105ka</title>\n ' print udata[219:240].encode('latin2') <title>Łąka</title>
The last statement prints as expected on my console with latin2 charset. But when using lxml something strange happens:
from lxml import etree t = etree.parse(open('test_doc.html'), etree.HTMLParser())
Now getting title element text:
t.getroot()[0][0].text u'\xc5\x81\xc4\x85ka'
t.getroot()[0][0].text.encode('latin2') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/share/python2.5/encodings/iso8859_2.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character u'\xc5' in
This is strange, because this is a unicode string (as indicated by the first "u") but it's representation printed to console is the same as raw bytes from 'sdata' var. I would expect that it should be equal to contents to 'udata' var. As a consequence converting to latin2 doesn't work: position 0: character maps to <undefined> If it's not an error, please tell me. For now I cannot even find any reasonable workaround. I'm using the latest lxml 1.3.6. Thanks for looking at this problem, Regards, Artur

Hi! Am Donnerstag, den 29.11.2007, 00:19 +0100 schrieb Artur Siekielski:
But when using lxml something strange happens:
from lxml import etree t = etree.parse(open('test_doc.html'), etree.HTMLParser())
Now getting title element text:
t.getroot()[0][0].text u'\xc5\x81\xc4\x85ka'
Did you try it with the h1-Element? Has it the same problem? I remember some discussions on the list about a similar problem. As far as I remember, libxml might have problems decoding the title properly, because the charset hint comes after the title has already been parsed. But I don't currently know any good workarounds. Maybe somebody else does, or you have a look at the list archive. Cheers, Frederik

Frederik Elwert wrote:
Hi!
Am Donnerstag, den 29.11.2007, 00:19 +0100 schrieb Artur Siekielski:
But when using lxml something strange happens:
from lxml import etree t = etree.parse(open('test_doc.html'), etree.HTMLParser())
Now getting title element text:
t.getroot()[0][0].text u'\xc5\x81\xc4\x85ka' Did you try it with the h1-Element? Has it the same problem?
Yes, with h1 there is the same error. But I noticed that when I moved meta tag with charset declaration before <title>, then all parsing goes OK, including h1 tag. So it's libxml2 bug/limitation (I tried latest libxml2 from trunk and it's the same)? I'm parsing 3rd party HTML, so I must find some workaround. Is this good solution: parse HTML, change elements sequence in <head>, serialiaze document and parse it again ? Regards, Artur

Am Donnerstag, den 29.11.2007, 18:21 +0100 schrieb Artur Siekielski:
Yes, with h1 there is the same error. But I noticed that when I moved meta tag with charset declaration before <title>, then all parsing goes OK, including h1 tag. So it's libxml2 bug/limitation (I tried latest libxml2 from trunk and it's the same)?
I'm parsing 3rd party HTML, so I must find some workaround. Is this good solution: parse HTML, change elements sequence in <head>, serialiaze document and parse it again ?
No, I think the better way would be to parse it, look for the encoding (either by looking at <tree>.docinfo.encoding or looking for the meta-Tag with find()), and then reparse the unaltered document, now using the "encoding" keyword. This is what Stefan suggests: http://article.gmane.org/gmane.comp.python.lxml.devel/3001/ Cheers, Frederik

Frederik Elwert napisał:
No, I think the better way would be to parse it, look for the encoding (either by looking at <tree>.docinfo.encoding or looking for the meta-Tag with find()), and then reparse the unaltered document, now using the "encoding" keyword. This is what Stefan suggests: http://article.gmane.org/gmane.comp.python.lxml.devel/3001/
Hi, thanks for suggestion. But how can I pass the "encoding" keyword? Neither etree.parse nor etree.HTMLParser supports it.

Am Donnerstag, den 29.11.2007, 19:41 +0100 schrieb Artur Siekielski:
No, I think the better way would be to parse it, look for the encoding (either by looking at <tree>.docinfo.encoding or looking for the meta-Tag with find()), and then reparse the unaltered document, now using the "encoding" keyword. This is what Stefan suggests: http://article.gmane.org/gmane.comp.python.lxml.devel/3001/
Hi, thanks for suggestion. But how can I pass the "encoding" keyword? Neither etree.parse nor etree.HTMLParser supports it.
Oh, I'm sorry. This is only supported by the alpha of lxml 2.0. Simply overlooked that. So for the time being, serialisation and reparsing might be the best option, but I haven't tried that. Cheers, Frederik

Frederik Elwert napisał:
Am Donnerstag, den 29.11.2007, 19:41 +0100 schrieb Artur Siekielski:
No, I think the better way would be to parse it, look for the encoding (either by looking at <tree>.docinfo.encoding or looking for the meta-Tag with find()), and then reparse the unaltered document, now using the "encoding" keyword. This is what Stefan suggests: http://article.gmane.org/gmane.comp.python.lxml.devel/3001/ Hi, thanks for suggestion. But how can I pass the "encoding" keyword? Neither etree.parse nor etree.HTMLParser supports it.
Oh, I'm sorry. This is only supported by the alpha of lxml 2.0. Simply overlooked that. So for the time being, serialisation and reparsing might be the best option, but I haven't tried that.
How stable is 2.0 alpha? I'm using lxml for parsing HTML and traversing parsed tree with etree API and XPath.

Am Donnerstag, den 29.11.2007, 21:05 +0100 schrieb Artur Siekielski:
How stable is 2.0 alpha? I'm using lxml for parsing HTML and traversing parsed tree with etree API and XPath.
I haven't used it yet. But it is reported to be fairly stable. There's going to be a beta soon, and that should freeze the API so that you won't have to chance you code later (although I think the API is already quite stable). Cheers, Frederik
participants (2)
-
Artur Siekielski
-
Frederik Elwert