Frederik Elwert wrote:
Hi!
Am Donnerstag, den 29.11.2007, 00:19 +0100 schrieb Artur Siekielski:
But when using lxml something strange happens:
from lxml import etree t = etree.parse(open('test_doc.html'), etree.HTMLParser())
Now getting title element text:
t.getroot()[0][0].text u'\xc5\x81\xc4\x85ka' Did you try it with the h1-Element? Has it the same problem?
Yes, with h1 there is the same error. But I noticed that when I moved meta tag with charset declaration before <title>, then all parsing goes OK, including h1 tag. So it's libxml2 bug/limitation (I tried latest libxml2 from trunk and it's the same)? I'm parsing 3rd party HTML, so I must find some workaround. Is this good solution: parse HTML, change elements sequence in <head>, serialiaze document and parse it again ? Regards, Artur