Re: [lxml-dev] Weird errors in tostring
Hi, Bruno Barberi Gnecco wrote:
Hi Stephan,
-f-
In the other machine all goes well. FYI, the tree (root variable) is being built with root = lxml.html.fromstring(data). I'm parsing data in utf8 and iso-8859-1, and this particular backtrace happened in a HTML document correctly labelled with a meta charset=iso-8859-1.
You can ask the document which encoding it was parsed with:
>>> print root.getroottree().docinfo.encoding
It should say "iso-8859-1" if the parser picked up the <meta> tag correctly.
It says 'None', actually.
Then that's a clear sign that libxml2 didn't pick up the encoding.
Shouldn't it give the error when *parsing* and creating the tree, instead of when converting the tree to something else?
HTML is parsed with the "recover" option, which lets libxml2 try to work around all sorts of broken page content *without* raising an error. You can still check the error log of the parser to see what happend on the way through the page.
I thought lxml stored the parsed tree in unicode.
UTF-8, actually, which is much easier (and faster) to handle in C than any other unicode encoding.
Besides, I'm asking for a unicode string:
tostring(root, method='xml', encoding=unicode)
Which lets lxml serialise the tree to a Python unicode character sequence in XML style. I know, this looks simple, but there's actually work being done here.
Also, maybe the <meta> tag comes behind the <title> in the document? AFAIR, libxml2's HTML parser switches encodings when it sees a <meta> declaration, but it doesn't reparse the document (as most browsers do to work around this problem).
It happens with fragments of HTML as well (I'm actually reading HTML messages). Yet I was having this problem with pages download from the internet, in which the encoding was incorrectly detected.
Which implies most of the time that it was incorrectly specified as well. That is a very common problem in real world HTML pages. Browsers do a great deal of work in their Quirks mode to figure out the page encoding. libxml2's HTML parser works pretty well, but fortune telling wasn't one of its design goals.
Since I had more information in that case (HTTP headers, with a chardet pass just to be sure) I ended up forcing the encoding with a 'html.decode(encoding)' step before building the tree. I think it's weird that it works (since some pages declare one encoding and use a different one), but it does.
You might want to strip <meta> Content Type tags from the string using a regex, that should make sure it works in all cases. Read the function "htmlCheckEncoding()" in libxml2's HTMLparser.c to see what works and what doesn't. For example, there is some code to prevent changing the parser encoding a second time, so that you can override it with the "encoding" parser keyword in lxml.
If the parser gets the encoding wrong, you can try parsing with BeautifulSoup (separate install) by using the fromstring() function in lxml.html.ElementSoup instead. That's quite a bit slower, but it *might* give you better results in this case.
I wrote a little doc section on that topic: http://codespeak.net/lxml/elementsoup.html#using-soupparser-as-a-fallback
First, why does it work in one of the machines and not in the other, even with the same data? I installed Python2.5, but with the same results. Maybe the cause is libxml2 (2.6.30 where it works, 2.6.26 where it doesn't)?
That's almost definitely the reason, yes.
Second, if the tree is created, how to know if the encoding is wrong? I only convert to string much later.
You can serialise immediately, just for testing, that will tell you. Or, you can check the parser error log for encoding errors. Stefan
participants (1)
-
Stefan Behnel