Trying to parse a HUGE(1gb) xml file
John Nagle
nagle at animats.com
Wed Dec 22 17:28:31 EST 2010
On 12/20/2010 12:33 PM, Adam Tauno Williams wrote:
> On Mon, 2010-12-20 at 12:29 -0800, spaceman-spiff wrote:
>> I need to detect them& then for each 1, i need to copy all the
>> content b/w the element's start& end tags& create a smaller xml
>> file.
>
> Yep, do that a lot; via iterparse.
>
>> 1. Can you point me to some examples/samples of using SAX,
>> especially , ones dealing with really large XML files.
I've just subclassed HTMLparser for this. It's slow, but
100% Python. Using the SAX parser is essentially equivalent.
I'm processing multi-gigabyte XML files and updating a MySQL
database, so I do need to look at all the entries, but don't
need a parse tree of the XML.
> SaX is equivalent to iterparse (iterpase is a way, to essentially, do
> SaX-like processing).
Iterparse does try to build a tree, although you can discard the
parts you don't want. If you can't decide whether a part of the XML
is of interest until you're deep into it, an "iterparse" approach
may result in a big junk tree. You have to keep clearing the "root"
element to discard that.
> I provided an iterparse example already. See the Read_Rows method in
> <http://coils.hg.sourceforge.net/hgweb/coils/coils/file/62335a211fda/src/coils/foundation/standard_xml.py>
I don't quite see the point of creating a class with only static
methods. That's basically a verbose way to create a module.
>
>> 2.This brings me to another q. which i forgot to ask in my OP(original post).
>> Is simply opening the file,& using reg ex to look for the element i need, a *good* approach ?
>
> No.
If the XML file has a very predictable structure, that may not be
a bad idea. It's not very general, but if you have some XML file
that's basically fixed format records using XML to delimit the
fields, pounding on the thing with a regular expression is simple
and fast.
John Nagle
More information about the Python-list
mailing list