Mailman 3 Re: [lxml-dev] Weird errors in tostring - lxml - The Python XML Toolkit

15 Apr 2008

      Hi,

Bruno Barberi Gnecco wrote:
...
Hi Stephan,
-f-
...
...
...
In the other machine all goes well. FYI, the tree (root variable) is
being built with root = lxml.html.fromstring(data). I'm parsing data
in utf8 and
iso-8859-1, and this particular backtrace happened in a HTML document
correctly labelled with a meta charset=iso-8859-1.
You can ask the document which encoding it was parsed with:
>>> print root.getroottree().docinfo.encoding
It should say "iso-8859-1" if the parser picked up the <meta> tag
correctly.
It says 'None', actually.
Then that's a clear sign that libxml2 didn't pick up the encoding.
...
Shouldn't it give the error when *parsing* and creating the tree,
instead of when converting the tree to something else?
HTML is parsed with the "recover" option, which lets libxml2 try to work
around all sorts of broken page content *without* raising an error. You can
still check the error log of the parser to see what happend on the way through
the page.
...
I thought lxml stored the parsed tree in unicode.
UTF-8, actually, which is much easier (and faster) to handle in C than any
other unicode encoding.
...
Besides, I'm asking for a unicode string:
tostring(root, method='xml', encoding=unicode)
Which lets lxml serialise the tree to a Python unicode character sequence in
XML style. I know, this looks simple, but there's actually work being done here.
...
...
Also, maybe the <meta> tag comes behind the <title> in the document?
AFAIR,
libxml2's HTML parser switches encodings when it sees a <meta>
declaration,
but it doesn't reparse the document (as most browsers do to work
around this
problem).
It happens with fragments of HTML as well (I'm actually reading HTML
messages). Yet I was having this problem with pages download from the
internet, in which the encoding was incorrectly detected.
Which implies most of the time that it was incorrectly specified as well. That
is a very common problem in real world HTML pages. Browsers do a great deal of
work in their Quirks mode to figure out the page encoding.

libxml2's HTML parser works pretty well, but fortune telling wasn't one of its
design goals.
...
Since I had more information
in that case (HTTP headers, with a chardet pass just to be sure) I ended up
forcing the encoding with a 'html.decode(encoding)' step before building
the tree. I think it's weird that it works (since some pages declare one
encoding and use a different one), but it does.
You might want to strip <meta> Content Type tags from the string using a
regex, that should make sure it works in all cases. Read the function
"htmlCheckEncoding()" in libxml2's HTMLparser.c to see what works and what
doesn't. For example, there is some code to prevent changing the parser
encoding a second time, so that you can override it with the "encoding" parser
keyword in lxml.
...
...
If the parser gets the encoding wrong, you can try parsing with
BeautifulSoup
(separate install) by using the fromstring() function in
lxml.html.ElementSoup
instead. That's quite a bit slower, but it *might* give you better
results in this case.
I wrote a little doc section on that topic:

http://codespeak.net/lxml/elementsoup.html#using-soupparser-as-a-fallback
...
First, why does it work in one of the machines and not in the other,
even with the same data? I installed Python2.5, but with the same results.
Maybe the cause is libxml2 (2.6.30 where it works, 2.6.26 where it
doesn't)?
That's almost definitely the reason, yes.
...
Second, if the tree is created, how to know if the encoding is
wrong? I only convert to string much later.
You can serialise immediately, just for testing, that will tell you. Or, you
can check the parser error log for encoding errors.

Stefan

Re: [lxml-dev] Weird errors in tostring

Stefan Behnel

tags

participants (1)