[lxml-dev] Weird errors in tostring
Hi, I'm getting a weird error in lxml.html.tostring; it happens in one machine but not in another, although both are using lxml 2.0.2, but one has python 2.5 (which works all the time) and the other python 2.4 (which doesn't). Here's the relevant backtrace: File "/home/spyder/spyder/core/base.py", line 289, in treetostring return tostring(root, method='xml', encoding=unicode) File "/usr/lib/python2.4/site-packages/lxml-2.0.2-py2.4-linux-i686.egg/lxml/html/ __init__.py", line 1313, in tostring encoding=encoding) File "lxml.etree.pyx", line 2455, in lxml.etree.tostring File "serializer.pxi", line 61, in lxml.etree._tostring File "serializer.pxi", line 126, in lxml.etree._tounicode UnicodeDecodeError: 'utf8' codec can't decode bytes in position 21-24: invalid data In the other machine all goes well. FYI, the tree (root variable) is being built with root = lxml.html.fromstring(data). I'm parsing data in utf8 and iso-8859-1, and this particular backtrace happened in a HTML document correctly labelled with a meta charset=iso-8859-1. If you have any ideas of how to trace what is going wrong?
Hi, Bruno wrote:
In the other machine all goes well. FYI, the tree (root variable) is being built with root = lxml.html.fromstring(data). I'm parsing data in utf8 and iso-8859-1, and this particular backtrace happened in a HTML document correctly labelled with a meta charset=iso-8859-1.
You can ask the document which encoding it was parsed with: >>> print root.getroottree().docinfo.encoding It should say "iso-8859-1" if the parser picked up the <meta> tag correctly. Also, maybe the <meta> tag comes behind the <title> in the document? AFAIR, libxml2's HTML parser switches encodings when it sees a <meta> declaration, but it doesn't reparse the document (as most browsers do to work around this problem). If the parser gets the encoding wrong, you can try parsing with BeautifulSoup (separate install) by using the fromstring() function in lxml.html.ElementSoup instead. That's quite a bit slower, but it *might* give you better results in this case. http://codespeak.net/lxml/elementsoup.html (note that the soupparser module was added in 2.0.3 to fix the parse() function. Just use the ElementSoup module in 2.0.2) Stefan
participants (2)
-
Bruno
-
Stefan Behnel