[lxml-dev] lxml about Target Parser

Hi,all When I used the lxml with self defined Target Parser, There is a function that can be redefined-- data . def data (self, data): When can we use it? and what it will do when we simply write a single line: "return " ? Is there any encoding conversion?

qhlonline wrote:
When I used the lxml with self defined Target Parser, There is a function that can be redefined-- data . def data (self, data): When can we use it?
when you want to receive character content from the document you parse.
and what it will do when we simply write a single line: "return " ?
nothing? actually, a "pass" will do in that case, as will not implementing the method (IIRC).
Is there any encoding conversion?
You will get either ASCII encoded byte strings or unicode strings, just like everywhere else. BTW, it's sometimes faster to try these things out than to ask a mailing list. Stefan

2009-06-03,"Stefan Behnel" <stefan_ml@behnel.de> :
qhlonline wrote:
When I used the lxml with self defined Target Parser, There is a function that can be redefined-- data . def data (self, data): When can we use it?
when you want to receive character content from the document you parse.
and what it will do when we simply write a single line: "return " ?
nothing? actually, a "pass" will do in that case, as will not implementing the method (IIRC).
Is there any encoding conversion?
You will get either ASCII encoded byte strings or unicode strings, just like everywhere else.
BTW, it's sometimes faster to try these things out than to ask a mailing list.
Stefan
Hi, Stefan My last mail has mixed the <meta charset> problem and target parser data function problem as one. I have made some tests and the result shows they are separate problems. When I do not define data function in my target parser, That will slove my problem of http://www.jiayuan.com/ web decoding error in parsing process. But still can't slove the problem of partly parsing caused by <meta> encoding declaration, eg. http://www.sina.com/ could be parsed, while a incomplete result was given. And I have dealed with this problem with two methods: The first one is to change the parsing content. After read out HTML string from the site http://www.sina.com/ ,I changed all <meta>'s content="charset **" attribute value as content="" to avoid encoding change in libxml2. This method is somewhat dangerous, Because at most times the <meta> declaration should be considered;The second method is for Chinese webs only, you know the largest character set of Chinese is GB18030 for now, So I changed the libxml2 source code and let GB18030 be the constant decoder. But this method can only resolve Chinese web problems of <meta charset> declaration error(It declared a different encoding to the web content), and I don't know whether webs of other language contains <meta> declaration irregular problems like that in Chinese. Although the http://www.jiayuan.com/ decoding error had been solved, I just don't know why. The method of shielding data function of my target parser is got by my lots of tests, and I'm searching for the reason. Could you give me some suggestion?
participants (2)
-
qhlonline
-
Stefan Behnel