parser.feed w/ target not "immediate"
data:image/s3,"s3://crabby-images/ccb36/ccb36a7a921098ea608253353dd5ac44ce736972" alt=""
(Bah sorry, hit wrong key) I'm using lxml to parse Junoscript, which is a protocol that's a single infinite XML document, a bit like XMPP. It goes: <junoscript> <pdu>...</pdu> <pdu>...</pdu> <pdu>...</pdu> I've been using the construct: parser = XMLParser(target=SomeClass()) while getdata(): parser.feed(data) ...and handling the start/end event callbacks to dispatch PDUs. I've run into problems in production where, all of a sudden, parser events aren't being dispatched when I expected. The difference seems to be the chunking of the data differs in production, for reasons of timing/load. I found this thread: http://thread.gmane.org/gmane.comp.python.lxml.devel/4871/focus=4881 ...which suggests this is actually expected, and my understanding of the parser/target stuff is wrong - there's no guarantee that "end" will be called at any particular time. Is this correct? Since it's an infinite document, I can't call ".close()". Does lxml have any API I can use to handle this, and that also has a push interface? Note I cannot let lxml read or block, since it's being used in an async fashion. I need to push the data in, and get notifications of tag start/end events as soon as the full open/close has happened. If lxml doesn't have this, can anyone recommend another parser? Cheers, Phil
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
On Wednesday, June 11, 2014 18:34:26 Phil Mayers wrote:
My understanding of the thread you linked to is that this was actually a stream of XML documents, not a single large (endless?) doc. Which makes it essentially non-XML, as an XML doc can only have one root node. Is this also the case in your situation i.e. a series of <junoscript> ...</junoscript> <junoscript> ...</junoscript> ... "documents"? Holger
data:image/s3,"s3://crabby-images/ccb36/ccb36a7a921098ea608253353dd5ac44ce736972" alt=""
No, my document is not a sequence of top-level elements. Let me put it another way: if I call: Parser.feed('<tag>') ...is it guaranteed that the target "start" method will be called before Parser.feed returns? -- Sent from my phone with, please excuse brevity and typos
data:image/s3,"s3://crabby-images/ccb36/ccb36a7a921098ea608253353dd5ac44ce736972" alt=""
On 11/06/14 21:22, Phil Mayers wrote:
It seems this is lxml version-specific, and related to whether it's the first ever call to .feed() on a parser instance. The following test script: #!/usr/bin/env python from __future__ import print_function from lxml import etree class T: def start(self, tag, attrib, ns): print("start", tag, attrib, ns) def data(self, data): print("data", data) def end(self, tag): print("end", tag) def close(self): print("close") data = """<?xml version="1.0" encoding="us-ascii"?><tag>head<child>text</child>tail</tag>""" print("etree version", etree.__version__) print("feeding bulk") p = etree.XMLParser(target=T()) p.feed(data) print("feeding bulk w/ initial ''") p = etree.XMLParser(target=T()) p.feed('') p.feed(data) ...gives me: etree version 2.2.3 feeding bulk feeding bulk w/ initial '' start tag {} {} data head start child {} {} data text end child data tail end tag ...note no events from the first parser, versus: etree version 3.2.4 feeding bulk start tag {} {} data head start child {} {} data text end child data tail end tag feeding bulk w/ initial '' start tag {} {} data head start child {} {} data text end child data tail end tag Seems 2.2 trains have some setup or something in the very first call to parser.feed which prevents events being emitted?
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Phil Mayers, 12.06.2014 17:25:
It should, but there isn't really a guarantee. In some cases, parsing a specific construct will be helped by looking ahead a fixed number of bytes, and if they are not there, parsing may get interrupted. This is very unlikely to happen for an end tag, but it depends on the data, so no guarantee. (Definitely not from lxml side, as the actual parser is implemented in libxml2 anyway.)
If an older version shows wrong behaviour and a newer one doesn't, it suggests that it's a bug that was fixed. You might also be using different libxml2 versions in both cases, which might have an impact as well. Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Phil Mayers, 11.06.2014 19:34:
http://lxml.de/parsing.html#incremental-event-parsing Also see recent threads on this list that mention XMPP. Stefan
data:image/s3,"s3://crabby-images/776d2/776d27937dcc62255199c99b76119d7f75ea96e4" alt=""
On Wednesday, June 11, 2014 18:34:26 Phil Mayers wrote:
My understanding of the thread you linked to is that this was actually a stream of XML documents, not a single large (endless?) doc. Which makes it essentially non-XML, as an XML doc can only have one root node. Is this also the case in your situation i.e. a series of <junoscript> ...</junoscript> <junoscript> ...</junoscript> ... "documents"? Holger
data:image/s3,"s3://crabby-images/ccb36/ccb36a7a921098ea608253353dd5ac44ce736972" alt=""
No, my document is not a sequence of top-level elements. Let me put it another way: if I call: Parser.feed('<tag>') ...is it guaranteed that the target "start" method will be called before Parser.feed returns? -- Sent from my phone with, please excuse brevity and typos
data:image/s3,"s3://crabby-images/ccb36/ccb36a7a921098ea608253353dd5ac44ce736972" alt=""
On 11/06/14 21:22, Phil Mayers wrote:
It seems this is lxml version-specific, and related to whether it's the first ever call to .feed() on a parser instance. The following test script: #!/usr/bin/env python from __future__ import print_function from lxml import etree class T: def start(self, tag, attrib, ns): print("start", tag, attrib, ns) def data(self, data): print("data", data) def end(self, tag): print("end", tag) def close(self): print("close") data = """<?xml version="1.0" encoding="us-ascii"?><tag>head<child>text</child>tail</tag>""" print("etree version", etree.__version__) print("feeding bulk") p = etree.XMLParser(target=T()) p.feed(data) print("feeding bulk w/ initial ''") p = etree.XMLParser(target=T()) p.feed('') p.feed(data) ...gives me: etree version 2.2.3 feeding bulk feeding bulk w/ initial '' start tag {} {} data head start child {} {} data text end child data tail end tag ...note no events from the first parser, versus: etree version 3.2.4 feeding bulk start tag {} {} data head start child {} {} data text end child data tail end tag feeding bulk w/ initial '' start tag {} {} data head start child {} {} data text end child data tail end tag Seems 2.2 trains have some setup or something in the very first call to parser.feed which prevents events being emitted?
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Phil Mayers, 12.06.2014 17:25:
It should, but there isn't really a guarantee. In some cases, parsing a specific construct will be helped by looking ahead a fixed number of bytes, and if they are not there, parsing may get interrupted. This is very unlikely to happen for an end tag, but it depends on the data, so no guarantee. (Definitely not from lxml side, as the actual parser is implemented in libxml2 anyway.)
If an older version shows wrong behaviour and a newer one doesn't, it suggests that it's a bug that was fixed. You might also be using different libxml2 versions in both cases, which might have an impact as well. Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Phil Mayers, 11.06.2014 19:34:
http://lxml.de/parsing.html#incremental-event-parsing Also see recent threads on this list that mention XMPP. Stefan
participants (3)
-
jholg@gmx.de
-
Phil Mayers
-
Stefan Behnel