xpathEval fails for large files

Kanch kanchana.senevirathna at gmail.com
Wed Jul 23 01:33:10 EDT 2008


On Jul 23, 2:03 am, Stefan Behnel <stefan... at behnel.de> wrote:
> Fredrik Lundh wrote:
> > Kanchana wrote:
>
> >> I tried to extract some data with xpathEval. Path contain more than
> >> 100,000 elements.
>
> >> doc = libxml2.parseFile("test.xml")
> >> ctxt = doc.xpathNewContext()
> >> result = ctxt.xpathEval('//src_ref/@editions')
> >> doc.freeDoc()
> >> ctxt.xpathFreeContext()
>
> >> this will stuck in following line and will result in high usage of
> >> CPU.
> >> result = ctxt.xpathEval('//src_ref/@editions')
>
> >> Any suggestions to resolve this.
>
> > what happens if you just search for "//src_ref"?  what happens if you
> > use libxml's command line tools to do the same search?
>
> >> Is there any better alternative to handle large documents?
>
> > the raw libxml2 API is pretty hopeless; there's a much nicer binding
> > called lxml:
>
> >    http://codespeak.net/lxml/
>
> > but that won't help if the problem is with libxml2 itself, though
>
> It may still help a bit as lxml's setup of libxml2 is pretty memory friendly
> and hand-tuned in a lot of places. But it's definitely worth trying with both
> cElementTree and lxml to see what works better for you. Depending on your
> data, this may be fastest in lxml 2.1:
>
>     doc = lxml.etree.parse("test.xml")
>     for el in doc.iter("src_ref"):
>         attrval = el.get("editions")
>         if attrval is not None:
>             # do something
>
> Stefan

Original file was 18MB, and contained 288328 element attributes for
the particular path.
I wonder whether for loop will cause a problem in iterating for 288328
times.



More information about the Python-list mailing list