Nov. 29, 2007
9:42 a.m.
Hi! Am Donnerstag, den 29.11.2007, 00:19 +0100 schrieb Artur Siekielski:
But when using lxml something strange happens:
from lxml import etree t = etree.parse(open('test_doc.html'), etree.HTMLParser())
Now getting title element text:
t.getroot()[0][0].text u'\xc5\x81\xc4\x85ka'
Did you try it with the h1-Element? Has it the same problem? I remember some discussions on the list about a similar problem. As far as I remember, libxml might have problems decoding the title properly, because the charset hint comes after the title has already been parsed. But I don't currently know any good workarounds. Maybe somebody else does, or you have a look at the list archive. Cheers, Frederik