Trying to parse a HUGE(1gb) xml file
Alan Meyer
ameyer2 at yahoo.com
Mon Dec 27 15:40:32 EST 2010
On 12/21/2010 3:16 AM, Stefan Behnel wrote:
> Adam Tauno Williams, 20.12.2010 20:49:
...
>> You need to process the document as a stream of elements; aka SAX.
>
> IMHO, this is the worst advice you can give.
Why do you say that? I would have thought that using SAX in this
application is an excellent idea.
I agree that for applications for which performance is not a problem,
and for which we need to examine more than one or a few element types, a
tree implementation is more functional, less programmer intensive, and
provides an easier to understand approach to the data. But with huge
amounts of data where performance is a problem SAX will be far more
practical. In the special case where only a few elements are of
interest in a complex tree, SAX can sometimes also be more natural and
easy to use.
SAX might also be more natural for this application. The O.P. could
tell us for sure, but I wonder if perhaps his 1 GB XML file is NOT a
true single record. You can store an entire text encyclopedia in less
than one GB. What he may have is a large number logically distinct
individual records of some kind, each stored as a node in an
all-encompassing element wrapper. Building a tree for each record could
make sense but, if I'm right about the nature of the data, building a
tree for the wrapper gives very little return for the high cost.
If that's so, then I'd recommend one of two approaches:
1. Use SAX, or
2. Parse out individual logical records using string manipulation on an
input stream, then build a tree for one individual record in memory
using one of the DOM or ElementTree implementations. After each record
is processed, discard its tree and start on the next record.
Alan
More information about the Python-list
mailing list