Re: [lxml-dev] Full XPath support ?

David Levy wrote:
Thanks ! I did test it successfully too yet. However iterpase is not supported :( So i can not parse large file (>100Mo) ...?! I am trying to mix celementtree and lxml right now !!
lxml right now is not optimal for parsing such large files. Support for iterparse would be nice though likely a lot of work, and full xpath support in that context might be extremely hard to accomplish. libxml2 does seem to have something called 'streaming xpath' that might be useful in this area though (but this is not yet exposed to lxml either). Regards, Martijn

Hi, On Tue, 2005-09-13 at 11:03 +0200, Martijn Faassen wrote:
David Levy wrote:
Thanks ! I did test it successfully too yet. However iterpase is not supported :( So i can not parse large file (>100Mo) ...?! I am trying to mix celementtree and lxml right now !!
lxml right now is not optimal for parsing such large files. Support for iterparse would be nice though likely a lot of work, and full xpath support in that context might be extremely hard to accomplish.
libxml2 does seem to have something called 'streaming xpath' that might be useful in this area though (but this is not yet exposed to lxml either).
I dunno if this can help, but Libxml2 has the pattern.c module, which is, amongst others, used by the xmlschemas.c module for the XPaths of identity-constraints. It is streamable, but supports only a minimal subset of XPath: only element and attribute tests without predicates, and only the child and descendant-or-self axis. An example would be: ".//foo/boo/@bar" - and that's it already. Regards, Kasimier

Kasimier Buchcik wrote:
Hi,
On Tue, 2005-09-13 at 11:03 +0200, Martijn Faassen wrote:
David Levy wrote:
Thanks ! I did test it successfully too yet. However iterpase is not supported :( So i can not parse large file (>100Mo) ...?! I am trying to mix celementtree and lxml right now !!
lxml right now is not optimal for parsing such large files. Support for iterparse would be nice though likely a lot of work, and full xpath support in that context might be extremely hard to accomplish.
libxml2 does seem to have something called 'streaming xpath' that might be useful in this area though (but this is not yet exposed to lxml either).
I dunno if this can help, but Libxml2 has the pattern.c module, which is, amongst others, used by the xmlschemas.c module for the XPaths of identity-constraints. It is streamable, but supports only a minimal subset of XPath: only element and attribute tests without predicates, and only the child and descendant-or-self axis. An example would be: ".//foo/boo/@bar" - and that's it already.
Is this the 'streamable XPath' I heard mentions of or is this something else? Note that I think David is interested in lxml precisely because it offers more complete XPath support than ElementTree; the subset you describe here is quite similar to ElementTree's. So, using pattern.c wouldn't be much of a big win for his particular use case. Then again, I think streaming anything close to full xpath is a very difficult problem; I recall reading a research paper about it for instance, so David might want something that doesn't really exist yet... Regards, Martijn

Hi, On Tue, 2005-09-13 at 12:46 +0200, Martijn Faassen wrote:
Kasimier Buchcik wrote:
Hi,
On Tue, 2005-09-13 at 11:03 +0200, Martijn Faassen wrote:
David Levy wrote:
Thanks ! I did test it successfully too yet. However iterpase is not supported :( So i can not parse large file (>100Mo) ...?! I am trying to mix celementtree and lxml right now !!
lxml right now is not optimal for parsing such large files. Support for iterparse would be nice though likely a lot of work, and full xpath support in that context might be extremely hard to accomplish.
libxml2 does seem to have something called 'streaming xpath' that might be useful in this area though (but this is not yet exposed to lxml either).
I dunno if this can help, but Libxml2 has the pattern.c module, which is, amongst others, used by the xmlschemas.c module for the XPaths of identity-constraints. It is streamable, but supports only a minimal subset of XPath: only element and attribute tests without predicates, and only the child and descendant-or-self axis. An example would be: ".//foo/boo/@bar" - and that's it already.
Is this the 'streamable XPath' I heard mentions of or is this something else?
Exact, it's the pattern.c module. Actually this was started by Daniel to be used for pattern like matches (as the module name suggests); later it was enhanced to work with XSD identity-constraints and to be used with the XPath module of Libxml2 to optimise processing of such simple subsets.
Note that I think David is interested in lxml precisely because it offers more complete XPath support than ElementTree; the subset you describe here is quite similar to ElementTree's. So, using pattern.c wouldn't be much of a big win for his particular use case.
True.
Then again, I think streaming anything close to full xpath is a very difficult problem; I recall reading a research paper about it for
Yeaaaah, I looked at the streaming path issue some time ago and found some interesting researches and solutions varying from what Libxml2 does now, to complex (and boring) mathematical stuff trying to precompute XPath expressions, to be able to even handle backward axes.
instance, so David might want something that doesn't really exist yet...
I just can say that I don't recall anybody ever mentioning XPath, as a whole, could be processed on a stream. Regards, Kasimier
participants (2)
-
Kasimier Buchcik
-
Martijn Faassen