Trying to parse a HUGE(1gb) xml file

John Nagle nagle at animats.com
Wed Dec 22 17:28:31 EST 2010


On 12/20/2010 12:33 PM, Adam Tauno Williams wrote:
> On Mon, 2010-12-20 at 12:29 -0800, spaceman-spiff wrote:
>> I need to detect them&  then for each 1, i need to copy all the
>> content b/w the element's start&  end tags&  create a smaller xml
>> file.
>
> Yep, do that a lot; via iterparse.
>
>> 1. Can you point me to some examples/samples of using SAX,
>> especially , ones dealing with really large XML files.

    I've just subclassed HTMLparser for this.  It's slow, but
100% Python.  Using the SAX parser is essentially equivalent.
I'm processing multi-gigabyte XML files and updating a MySQL
database, so I do need to look at all the entries, but don't
need a parse tree of the XML.

> SaX is equivalent to iterparse (iterpase is a way, to essentially, do
> SaX-like processing).

    Iterparse does try to build a tree, although you can discard the
parts you don't want.  If you can't decide whether a part of the XML
is of interest until you're deep into it, an "iterparse" approach
may result in a big junk tree.  You have to keep clearing the "root"
element to discard that.

> I provided an iterparse example already. See the Read_Rows method in
> <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>

    I don't quite see the point of creating a class with only static 
methods.  That's basically a verbose way to create a module.
>
>> 2.This brings me to another q. which i forgot to ask in my OP(original post).
>> Is simply opening the file,&  using reg ex to look for the element i need, a *good* approach ?
>
> No.

    If the XML file has a very predictable structure, that may not be
a bad idea.  It's not very general, but if you have some XML file
that's basically fixed format records using XML to delimit the
fields, pounding on the thing with a regular expression is simple
and fast.
			
					John Nagle





More information about the Python-list mailing list