[lxml-dev] Target parser parsing error
Hi, all There are more informations about my parsing error when I use target parser to parse http://www.jiayuan.com/ . The fatal error reported out is: Input is not proper UTF-8, indicate encoding ! To find the real place where this problem occured, I have tried to convert the HTML string encoding with iconv directly. This time it also report error, and the error character index in string is just the same with my lxml test. Now things are clear that this parsing error is caused by encoding conversion of iconv from utf-8 to utf-8 when there are illegal characters in the source. When I do not define the data function in my target parser, It will paser without error report. Is it means that when I escape the data function , the UTF-8 to UTF-8 conversion is also escaped ? Or some correct conversion has been done before the call to the data function ? yours
qhlonline wrote:
There are more informations about my parsing error when I use target parser to parse http://www.jiayuan.com/ . The fatal error reported out is: Input is not proper UTF-8, indicate encoding ! To find the real place where this problem occured, I have tried to convert the HTML string encoding with iconv directly. This time it also report error, and the error character index in string is just the same with my lxml test. Now things are clear that this parsing error is caused by encoding conversion of iconv from utf-8 to utf-8 when there are illegal characters in the source.
What do you mean by "from utf-8 to utf-8" conversion?
When I do not define the data function in my target parser, It will paser without error report. Is it means that when I escape the data function , the UTF-8 to UTF-8 conversion is also escaped ? Or some correct conversion has been done before the call to the data function ?
It just means that the parser has ignored your character content. There are two levels here. The libxml2 parser will parse the byte stream and try to convert it to UTF-8. If that fails but it is asked to "recover" from it, it will just continue without raising an error. Not sure what becomes of the data in this case, but apparently there is no guarantee that the invalid bytes that were parsed up to this point get stripped. The second level is where lxml comes into the play. When you define a "data()" method on your target parser, you ask lxml to pass you the character data from the document. lxml's SAX handler will then try to decode the UTF-8 data provided by the libxml2 parser to pass it into your method. If the data returned by the parser is not valid UTF-8, this will fail. I assume that this is where the exception that you see originates from, as this is done through the Python Codec API. Does this clear things up? That said, I could imagine letting the character decoder work around broken data if the "recover" option is enabled, simply by replacing broken content with a replacement character. This would improve the recovery capabilities in your case, without breaking the data any further than it already is. Stefan
Hi, 2009-06-03,"Stefan Behnel" <stefan_ml@behnel.de> :
qhlonline wrote:
There are more informations about my parsing error when I use target parser to parse http://www.jiayuan.com/ . The fatal error reported out is: Input is not proper UTF-8, indicate encoding ! To find the real place where this problem occured, I have tried to convert the HTML string encoding with iconv directly. This time it also report error, and the error character index in string is just the same with my lxml test. Now things are clear that this parsing error is caused by encoding conversion of iconv from utf-8 to utf-8 when there are illegal characters in the source.
What do you mean by "from utf-8 to utf-8" conversion? I am not sure whether this conversion had taken place. But when I convert the html content string in this way. It will report illegal character error in some place of the string, and it is just the place where my lxml target parser generate error. Does it ensure that there are illegal characters in the html content?
When I do not define the data function in my target parser, It will paser without error report. Is it means that when I escape the data function , the UTF-8 to UTF-8 conversion is also escaped ? Or some correct conversion has been done before the call to the data function ?
It just means that the parser has ignored your character content. There are two levels here. The libxml2 parser will parse the byte stream and try to convert it to UTF-8. If that fails but it is asked to "recover" from it, it will just continue without raising an error. Not sure what becomes of the data in this case, but apparently there is no guarantee that the invalid bytes that were parsed up to this point get stripped.
I agree with you. I have thought about what libxml2 would do when an illegal character came. Your answer makes me clear at this point.
The second level is where lxml comes into the play. When you define a "data()" method on your target parser, you ask lxml to pass you the character data from the document. lxml's SAX handler will then try to decode the UTF-8 data provided by the libxml2 parser to pass it into your method. If the data returned by the parser is not valid UTF-8, this will fail. I assume that this is where the exception that you see originates from, as this is done through the Python Codec API. Yes, That is the case. But the illegal character came out side of lxml and outside of libxml2, The whole string was got from an URL by using urllib module in python. So, I wonder whether there were some other method to get HTML content from URL without illegal characters.
Does this clear things up? Thank you for your help, I think I have learned more about lxml parsing process with your guidance. Thank you!
That said, I could imagine letting the character decoder work around broken data if the "recover" option is enabled, simply by replacing broken content with a replacement character. This would improve the recovery capabilities in your case, without breaking the data any further than >already is.
Stefan
Happy days! yours
Hi, qhlonline wrote:
2009-06-03,"Stefan Behnel" wrote:
The libxml2 parser will parse the byte stream and try to convert it to UTF-8. If that fails but it is asked to "recover" from it, it will just continue without raising an error. Not sure what becomes of the data in this case, but apparently there is no guarantee that the invalid bytes that were parsed up to this point get stripped.
I agree with you. I have thought about what libxml2 would do when an illegal character came. Your answer makes me clear at this point.
Then its clearer to you than to me. I'm actually not convinced yet that this is the case. I was rather guessing based on my (limited) knowledge about the problem you observe, which I have never observed myself in the wild. The parser of libxml2 uses leveled buffers that copy the data during decoding. That may already be a sufficient barrier against such problems. What about posting a self-contained and stripped-down to the minimum Python module that shows the unexpected behaviour? Nothing that accesses the internet or something, just embed a sufficient part of a failing web page as a string (possibly base64 encoded). That way, others could try to reproduce the problem on their side and debug it.
The second level is where lxml comes into the play. When you define a "data()" method on your target parser, you ask lxml to pass you the character data from the document. lxml's SAX handler will then try to decode the UTF-8 data provided by the libxml2 parser to pass it into your method. If the data returned by the parser is not valid UTF-8, this will fail. I assume that this is where the exception that you see originates from, as this is done through the Python Codec API.
Yes, That is the case. But the illegal character came out side of lxml and outside of libxml2, The whole string was got from an URL by using urllib module in python. So, I wonder whether there were some other method to get HTML content from URL without illegal characters.
Well, as I said before: if the HTML is broken, there is no way to make sure the parser can read all data 'correctly' (whatever that means in this context). If the web page adheres to an encoding and just fails to declare it correctly, your best bet is to decode the page into a unicode string yourself, catch and handle any decoding errors in a suitable way, and pass that unicode string into the parser. Stefan
participants (2)
-
qhlonline
-
Stefan Behnel