[lxml-dev] Encoding problems with lxml
Hello, I'm having some encoding problems with lxml that I can't solve. My application is a small web mining spider. Pages downloaded can be in any encoding, but I'm expecting mostly utf8 and iso-8859-1. I need to get the parsed data in iso-8859-1. I'm having two problems: a) when reading pages in iso-8859-1, accented characters are converted to HTML sequences, such as à for ` + a. I don't want this to happen, how to avoid it? b) I can't convert pages originally in UTF to ISO, even using etree.tostring(entry, 'iso-8859-1') or string.encode("iso-8859-1"). Have I missed something in the docs? I want to have a homogeneous behavior for all encodings--even if it means to convert first to UTF and later to ISO. Thanks a lot for any help, Bruno
Bruno Barberi Gnecco wrote:
I'm having some encoding problems with lxml that I can't solve. My application is a small web mining spider. Pages downloaded can be in any encoding, but I'm expecting mostly utf8 and iso-8859-1. I need to get the parsed data in iso-8859-1.
Note that this may already fail in the decoding step of the parser. If the HTML is so *broken* that libxml2 can't even detect a <meta> encoding tag, it will not know what encoding to use.
I'm having two problems:
a) when reading pages in iso-8859-1, accented characters are converted to HTML sequences, such as à for ` + a. I don't want this to happen, how to avoid it?
You can serialise through an XSLT. The lxml.html module in lxml 2.0 will do that for you, but you can easily implement that yourself. Look for "Serialization" in http://codespeak.net/svn/lxml/branch/html/src/lxml/html/__init__.py
b) I can't convert pages originally in UTF to ISO, even using etree.tostring(entry, 'iso-8859-1') or string.encode("iso-8859-1").
Both should work in general (the first being better anyway) - except when you have a <meta> tag in there that says "utf-8" encoding. Then you can't expect the browser to ignore that. lxml will not magically delete it either, you have to do that by hand. IIRC, the XSLT serialisation step should also add one for you.
Have I missed something in the docs? I want to have a homogeneous behavior for all encodings--even if it means to convert first to UTF and later to ISO.
You don't have to, at least, not for working on the tree. lxml will properly encode strings to Python (unicode) strings at the API level - *iff* the parser managed to detect the encoding of the HTML page. If not, you will get garbage. But then that's really the fault of the page. If you have any other way to detect the encoding of a broken page (e.g. all pages from a specific source are undeclared UTF-8 or something), you can also pre-treat the input *before* parsing it, i.e. recode it properly and remove the <meta> tag with a regular expression. Then the parser should no longer have any problems. Stefan
Stefan Behnel wrote: Thanks a lot for the prompt answer, Stefan.
I'm having some encoding problems with lxml that I can't solve. My application is a small web mining spider. Pages downloaded can be in any encoding, but I'm expecting mostly utf8 and iso-8859-1. I need to get the parsed data in iso-8859-1.
Note that this may already fail in the decoding step of the parser. If the HTML is so *broken* that libxml2 can't even detect a <meta> encoding tag, it will not know what encoding to use.
I see, this is what is happening. I even found a page that declared to be iso and was utf-8. The parser I was using before (PHP's DOM) seemed to get over that somehow, so I haven't even noticed it.
I'm having two problems:
a) when reading pages in iso-8859-1, accented characters are converted to HTML sequences, such as à for ` + a. I don't want this to happen, how to avoid it?
You can serialise through an XSLT. The lxml.html module in lxml 2.0 will do that for you, but you can easily implement that yourself.
Look for "Serialization" in http://codespeak.net/svn/lxml/branch/html/src/lxml/html/__init__.py
....
I only noticed now that this was referring to parsing. Any reason you don't want entities resolved her?
lxml 2.0 will allow you to keep entities in the tree, although they are rarely of any help.
I don't want the characters converted to entities because I'm adding the extracted data to a searchable database (which already exists). If I use sequences such as & ccedil; or & #233;, searching will fail. You may suggest to convert the search string to sequences, but then I miss useful features such as case insensivity and unaccented words matching accented ones. I'd prefer to have a standard encoding on the tree, because my xpath query might contain accented characters. I don't care how it is encoded in the tree, as long as I can query it. And, when I extract the data with etree.tostring(entry, 'iso-8859-1') (entry being the result of a xpath query or a find(tag)), I'd like to have iso characters instead of sequences, whatever the original encoding. This I haven't been able to do yet, any tips?
b) I can't convert pages originally in UTF to ISO, even using etree.tostring(entry, 'iso-8859-1') or string.encode("iso-8859-1").
Both should work in general (the first being better anyway) - except when you have a <meta> tag in there that says "utf-8" encoding. Then you can't expect the browser to ignore that. lxml will not magically delete it either, you have to do that by hand.
But shouldn't tostring() convert to iso, even if it was in utf-8?
Have I missed something in the docs? I want to have a homogeneous behavior for all encodings--even if it means to convert first to UTF and later to ISO.
You don't have to, at least, not for working on the tree. lxml will properly encode strings to Python (unicode) strings at the API level - *iff* the parser managed to detect the encoding of the HTML page. If not, you will get garbage. But then that's really the fault of the page.
If you have any other way to detect the encoding of a broken page (e.g. all pages from a specific source are undeclared UTF-8 or something), you can also pre-treat the input *before* parsing it, i.e. recode it properly and remove the <meta> tag with a regular expression. Then the parser should no longer have any problems.
Hm, I see. Everyday I remember that adagio from Tanenbaum, "The good thing about standards is that there are so many to choose from." I suppose the most robust solution then is to try to find the encoding of the page myself, and make sure that <meta> is correct, and possibly check et.docinfo.encoding to see if lxml got it right. Is that it? Thanks again, Bruno
Hi Bruno, I had similar problems, except my HTML was even more broken, so I ended up using the Elementtree.TidyHTMLTreeBuilder to first parse the page, then converted the result string with etree.XML() to an lxml tree. This didn't solve the encoding problem, just the broken HTML problem. For encoding detection, check out Beautifulsoup, which has very kindly functional-ized its encoding detection ( http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20... ) And Leonard basically got this encoding detection from here: http://chardet.feedparser.org Good luck! -Roger Bruno Barberi Gnecco wrote:
Stefan Behnel wrote:
Thanks a lot for the prompt answer, Stefan.
I'm having some encoding problems with lxml that I can't solve. My application is a small web mining spider. Pages downloaded can be in any encoding, but I'm expecting mostly utf8 and iso-8859-1. I need to get the parsed data in
iso-8859-1.
Note that this may already fail in the decoding step of the parser. If the HTML is so *broken* that libxml2 can't even detect a <meta> encoding tag, it will not know what encoding to use.
I see, this is what is happening. I even found a page that declared to be iso and was utf-8. The parser I was using before (PHP's DOM) seemed to get over that somehow, so I haven't even noticed it.
I'm having two problems:
a) when reading pages in iso-8859-1, accented characters are converted to HTML sequences, such as à for ` + a. I don't want this to happen, how to avoid it?
You can serialise through an XSLT. The lxml.html module in lxml 2.0 will do that for you, but you can easily implement that yourself.
Look for "Serialization" in http://codespeak.net/svn/lxml/branch/html/src/lxml/html/__init__.py
....
I only noticed now that this was referring to parsing. Any reason you don't want entities resolved her?
lxml 2.0 will allow you to keep entities in the tree, although they are rarely of any help.
I don't want the characters converted to entities because I'm adding the extracted data to a searchable database (which already exists). If I use sequences such as & ccedil; or & #233;, searching will fail.
You may suggest to convert the search string to sequences, but then I miss useful features such as case insensivity and unaccented words matching accented ones.
I'd prefer to have a standard encoding on the tree, because my xpath query might contain accented characters. I don't care how it is encoded in the tree, as long as I can query it.
And, when I extract the data with etree.tostring(entry, 'iso-8859-1') (entry being the result of a xpath query or a find(tag)), I'd like to have iso characters instead of sequences, whatever the original encoding. This I haven't been able to do yet, any tips?
b) I can't convert pages originally in UTF to ISO, even using etree.tostring(entry, 'iso-8859-1') or string.encode("iso-8859-1").
Both should work in general (the first being better anyway) - except when you have a <meta> tag in there that says "utf-8" encoding. Then you can't expect the browser to ignore that. lxml will not magically delete it either, you have to do that by hand.
But shouldn't tostring() convert to iso, even if it was in utf-8?
Have I missed something in the docs? I want to have a homogeneous behavior for all encodings--even if it means to convert first to UTF and later to ISO.
You don't have to, at least, not for working on the tree. lxml will properly encode strings to Python (unicode) strings at the API level - *iff* the parser managed to detect the encoding of the HTML page. If not, you will get garbage. But then that's really the fault of the page.
If you have any other way to detect the encoding of a broken page (e.g. all pages from a specific source are undeclared UTF-8 or something), you can also pre-treat the input *before* parsing it, i.e. recode it properly and remove the <meta> tag with a regular expression. Then the parser should no longer have any problems.
Hm, I see. Everyday I remember that adagio from Tanenbaum, "The good thing about standards is that there are so many to choose from."
I suppose the most robust solution then is to try to find the encoding of the page myself, and make sure that <meta> is correct, and possibly check et.docinfo.encoding to see if lxml got it right. Is that it?
Thanks again,
Bruno
_______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev
Bruno Barberi Gnecco wrote:
Stefan Behnel wrote:
Thanks a lot for the prompt answer, Stefan.
I'm having some encoding problems with lxml that I can't solve. My application is a small web mining spider. Pages downloaded can be in any encoding, but I'm expecting mostly utf8 and iso-8859-1. I need to get the parsed data in iso-8859-1.
Note that this may already fail in the decoding step of the parser. If the HTML is so *broken* that libxml2 can't even detect a <meta> encoding tag, it will not know what encoding to use.
I see, this is what is happening. I even found a page that declared to be iso and was utf-8. The parser I was using before (PHP's DOM) seemed to get over that somehow, so I haven't even noticed it.
There's not much lxml (or libxml2) can do about this. While libxml2 is pretty good in parsing broken HTML, it still can't handle plain tag soup and it believes a page that explicitly says "I use that encoding".
a) when reading pages in iso-8859-1, accented characters are converted to HTML sequences, such as à for ` + a. I don't want this to happen, how to avoid it?
I only noticed now that this was referring to parsing. Any reason you don't want entities resolved her?
lxml 2.0 will allow you to keep entities in the tree, although they are rarely of any help.
I don't want the characters converted to entities because I'm adding the extracted data to a searchable database (which already exists). If I use sequences such as & ccedil; or & #233;, searching will fail.
The parser will not convert any characters to entities, it will give you Unicode strings (or plain strings if it's ASCII). Only the serialiser *may* create entities, depending on the encoding. Look at what you get in the text content inside the tree (not the serialised document), I'd be surprised if you found any entity names in there.
I'd prefer to have a standard encoding on the tree, because my xpath query might contain accented characters. I don't care how it is encoded in the tree, as long as I can query it.
lxml supports unicode everywhere.
And, when I extract the data with etree.tostring(entry, 'iso-8859-1') (entry being the result of a xpath query or a find(tag)), I'd like to have iso characters instead of sequences, whatever the original encoding. This I haven't been able to do yet, any tips?
Hmmm, interesting. Are you sure the document was parsed correctly? Check the tree to see if you get the correct texts (and not some weird Unicode characters) in there. Here's what I get:
import lxml.etree as et html = et.HTML("<html><body>üüaösäüöéèàádäüaöü</body></html>") et.tostring(html, encoding="iso-8859-1") "<?xml version='1.0' encoding='iso-8859-1'?>\n<html><body>\xfc\xfca\xf6s\xe4\xfc\xf6\xe9\xe8\xe0\xe1d\xe4\xfca\xf6\xfc</body></html>"
No entities in there, just plain Latin-1 characters. I really think you hit a case where the parser detected the wrong encoding, so that the tree couldn't get serialised to your target encoding afterwards.
b) I can't convert pages originally in UTF to ISO, even using etree.tostring(entry, 'iso-8859-1') or string.encode("iso-8859-1").
Both should work in general (the first being better anyway) - except when you have a <meta> tag in there that says "utf-8" encoding. Then you can't expect the browser to ignore that. lxml will not magically delete it either, you have to do that by hand.
But shouldn't tostring() convert to iso, even if it was in utf-8?
That's why you get entities. It's not ISO 8859-1 that's in the tree - at least not from the point of view of the parser.
I suppose the most robust solution then is to try to find the encoding of the page myself, and make sure that <meta> is correct, and possibly check et.docinfo.encoding to see if lxml got it right. Is that it?
In your case, that's definitely the safest bet, especially if you only have two possible input encodings. I'd do this: try decoding it from UTF-8 first (Python's "...".decode()) and if that fails, fall back to decoding it as ISO 8859-1. UTF-8 is a well defined multi-byte encoding, so it's relatively easy to distinguish from single-byte encodings such as ISO-8859-x. Then remove any <meta> encoding tags you find (use a regexp) and pass it into the HTML() factory *as a Python unicode string*.
import lxml.etree as et input = u"<html><body>üüaösäüöéèàádäüaöü</body></html>" # Unicode ! html = et.HTML(input)
et.tostring(html, encoding="iso-8859-1") "<?xml version='1.0' encoding='iso-8859-1'?>\n<html><body>\xfc\xfca\xf6s\xe4\xfc\xf6\xe9\xe8\xe0\xe1d\xe4\xfca\xf6\xfc</body></html>"
Does that solve your problem? Stefan
Bruno Barberi Gnecco wrote:
a) when reading pages in iso-8859-1, accented characters are converted to HTML sequences, such as à for ` + a. I don't want this to happen, how to avoid it?
I only noticed now that this was referring to parsing. Any reason you don't want entities resolved her? lxml 2.0 will allow you to keep entities in the tree, although they are rarely of any help. Stefan
participants (3)
-
Bruno Barberi Gnecco -
Roger Patterson -
Stefan Behnel