[Tutor] xml parsing from xml

Danny Yoo dyoo at hashcollision.org
Wed May 7 22:39:51 CEST 2014


On Wed, May 7, 2014 at 1:26 PM, jitendra gupta <jitu.icfai at gmail.com> wrote:

> I cant use etree/SAX because there we cant get complete line , of course we
> can get it by tag name but we are not sure about tag also.  Only we know
> what ever child of <country> we need to put in new file with country name.


Why can't you use such an  approach here?  You're dealing with
structured data: there's no concept of "line" in XML, so I don't know
what you mean.  You can keep an intermediate state of the events
you've seen.  At some point, after you encounter the end element of an
particular country, you'll have seen the information you need to
determine which file the country should go to.

The pseudocode would be something like:

################################################################
read events up to beginning of data
buffer = []
while there are still events:
    collect events up to country end into the "buffer"
    decide what file it goes to, and replay the "buffer" into the
appropriate file
    clear "buffer"
################################################################

If you don't want to deal with a event-driven approach that SAX
emphasizes, you may still be able to do this problem with an XML-Pull
parser.  You mention that your input is hundreds of megabytes long, in
which case you probably really do need to be careful about memory
consumption.  See:

    https://wiki.python.org/moin/PullDom

for an example that filters subtrees.  You should be able to quickly
adapt that example to redirect elements based on whatever criteria you
decide.


More information about the Tutor mailing list