Hello, everyone I'm working for extracting text from wikipedia, and now i have a trouble about parser the big file 500GB with iterparse(), i show that problem with example in attached-file: "tmp.xml" is input corpus "t.xml" is result "extraction.py" is program Python, like below: #! /usr/bin/python # -*- coding:utf-8 -*- from lxml import etree ns = "http://www.mediawiki.org/xml/export-0.8/" node_find = "{%s}%s"%(ns,"ns") f1 = "tmp.xml" tree = etree.iterparse(f1,events=("end",), tag=node_find) for event, elem in tree: if elem.text == "0": print etree.tostring(elem.getparent(), encoding="utf-8", pretty_prin xml_declaration=True) elem.clear() while elem.getprevious() is not None: del elem.getparent()[0] break using this script, i find at first tag <ns> and if text of <ns> is "0", give back all node from his parent. but, i don't have all the node, look "t.xml", i have only 15 <revision>, it's too strange. i don't know how to do, can somebody can help? Thank you in advance! Kun JIN
участники (1)
-
Kun JIN