Mailman 3 Question about iterparse() - lxml - The Python XML Toolkit

1 Mar 2014

      Hello, everyone

I'm working for extracting text from wikipedia, and now i have a trouble
about parser the big file 500GB with iterparse(), i show that problem
with example in attached-file:

"tmp.xml" is input corpus
"t.xml" is result
"extraction.py" is program Python, like below:

#! /usr/bin/python
# -*- coding:utf-8 -*-

from lxml import etree

ns = "http://www.mediawiki.org/xml/export-0.8/"
node_find = "{%s}%s"%(ns,"ns")
f1 = "tmp.xml"
tree = etree.iterparse(f1,events=("end",), tag=node_find)
for event, elem in tree:
       if elem.text == "0":
           print etree.tostring(elem.getparent(), encoding="utf-8",
pretty_prin
                   xml_declaration=True)
       elem.clear()
       while elem.getprevious() is not None:
           del elem.getparent()[0]
       break

using this script, i find at first tag <ns> and if text of <ns> is "0",
give back all node from his parent.

but, i don't have all the node, look "t.xml", i have only 15 <revision>,
it's too strange.

i don't know how to do, can somebody can help?

Thank you in advance!

Kun JIN

Question about iterparse()

Kun JIN

tags

участники (1)