[lxml-dev] Error (?) with UTF-8 document and Python unicode repr.

Hi. First of all, thanks for a great XML/HTML library! API is really good thought. I'm coming here with a problem with HTML doc encoded with UTF-8: $ cat test_doc.html <?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>A*?Ä?ka</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <h1>GdaA*?sk</h1> </body> </html> ("title" and "h1" contents are utf-8 strings, decodable to latin2). From raw Python everything seems to be as expected:
The last statement prints as expected on my console with latin2 charset. But when using lxml something strange happens:
from lxml import etree t = etree.parse(open('test_doc.html'), etree.HTMLParser())
Now getting title element text:
t.getroot()[0][0].text u'\xc5\x81\xc4\x85ka'
This is strange, because this is a unicode string (as indicated by the first "u") but it's representation printed to console is the same as raw bytes from 'sdata' var. I would expect that it should be equal to contents to 'udata' var. As a consequence converting to latin2 doesn't work: position 0: character maps to <undefined> If it's not an error, please tell me. For now I cannot even find any reasonable workaround. I'm using the latest lxml 1.3.6. Thanks for looking at this problem, Regards, Artur

Hi! Am Donnerstag, den 29.11.2007, 00:19 +0100 schrieb Artur Siekielski:
Did you try it with the h1-Element? Has it the same problem? I remember some discussions on the list about a similar problem. As far as I remember, libxml might have problems decoding the title properly, because the charset hint comes after the title has already been parsed. But I don't currently know any good workarounds. Maybe somebody else does, or you have a look at the list archive. Cheers, Frederik

Frederik Elwert wrote:
Yes, with h1 there is the same error. But I noticed that when I moved meta tag with charset declaration before <title>, then all parsing goes OK, including h1 tag. So it's libxml2 bug/limitation (I tried latest libxml2 from trunk and it's the same)? I'm parsing 3rd party HTML, so I must find some workaround. Is this good solution: parse HTML, change elements sequence in <head>, serialiaze document and parse it again ? Regards, Artur

Am Donnerstag, den 29.11.2007, 18:21 +0100 schrieb Artur Siekielski:
No, I think the better way would be to parse it, look for the encoding (either by looking at <tree>.docinfo.encoding or looking for the meta-Tag with find()), and then reparse the unaltered document, now using the "encoding" keyword. This is what Stefan suggests: http://article.gmane.org/gmane.comp.python.lxml.devel/3001/ Cheers, Frederik

Am Donnerstag, den 29.11.2007, 19:41 +0100 schrieb Artur Siekielski:
Oh, I'm sorry. This is only supported by the alpha of lxml 2.0. Simply overlooked that. So for the time being, serialisation and reparsing might be the best option, but I haven't tried that. Cheers, Frederik

Am Donnerstag, den 29.11.2007, 21:05 +0100 schrieb Artur Siekielski:
How stable is 2.0 alpha? I'm using lxml for parsing HTML and traversing parsed tree with etree API and XPath.
I haven't used it yet. But it is reported to be fairly stable. There's going to be a beta soon, and that should freeze the API so that you won't have to chance you code later (although I think the API is already quite stable). Cheers, Frederik

Hi! Am Donnerstag, den 29.11.2007, 00:19 +0100 schrieb Artur Siekielski:
Did you try it with the h1-Element? Has it the same problem? I remember some discussions on the list about a similar problem. As far as I remember, libxml might have problems decoding the title properly, because the charset hint comes after the title has already been parsed. But I don't currently know any good workarounds. Maybe somebody else does, or you have a look at the list archive. Cheers, Frederik

Frederik Elwert wrote:
Yes, with h1 there is the same error. But I noticed that when I moved meta tag with charset declaration before <title>, then all parsing goes OK, including h1 tag. So it's libxml2 bug/limitation (I tried latest libxml2 from trunk and it's the same)? I'm parsing 3rd party HTML, so I must find some workaround. Is this good solution: parse HTML, change elements sequence in <head>, serialiaze document and parse it again ? Regards, Artur

Am Donnerstag, den 29.11.2007, 18:21 +0100 schrieb Artur Siekielski:
No, I think the better way would be to parse it, look for the encoding (either by looking at <tree>.docinfo.encoding or looking for the meta-Tag with find()), and then reparse the unaltered document, now using the "encoding" keyword. This is what Stefan suggests: http://article.gmane.org/gmane.comp.python.lxml.devel/3001/ Cheers, Frederik

Am Donnerstag, den 29.11.2007, 19:41 +0100 schrieb Artur Siekielski:
Oh, I'm sorry. This is only supported by the alpha of lxml 2.0. Simply overlooked that. So for the time being, serialisation and reparsing might be the best option, but I haven't tried that. Cheers, Frederik

Am Donnerstag, den 29.11.2007, 21:05 +0100 schrieb Artur Siekielski:
How stable is 2.0 alpha? I'm using lxml for parsing HTML and traversing parsed tree with etree API and XPath.
I haven't used it yet. But it is reported to be fairly stable. There's going to be a beta soon, and that should freeze the API so that you won't have to chance you code later (although I think the API is already quite stable). Cheers, Frederik
participants (2)
-
Artur Siekielski
-
Frederik Elwert