Trying to parse a HUGE(1gb) xml file

Alan Meyer ameyer2 at yahoo.com
Mon Dec 27 19:29:50 EST 2010


On 12/27/2010 4:55 PM, Stefan Behnel wrote:
...
>  From my experience, SAX is only practical for very simple cases where
> little state is involved when extracting information from the parse
> events. A typical example is gathering statistics based on single tags -
> not a very common use case. Anything that involves knowing where in the
> XML tree you are to figure out what to do with the event is already too
> complicated. The main drawback of SAX is that the callbacks run into
> separate method calls, so you have to do all the state keeping manually
> through fields of the SAX handler instance.
>
> My serious advices is: don't waste your time learning SAX. It's simply
> too frustrating to debug SAX extraction code into existence. Given how
> simple and fast it is to extract data with ElementTree's iterparse() in
> a memory efficient way, there is really no reason to write complicated
> SAX code instead.
>
> Stefan
>

I confess that I hadn't been thinking about iterparse().  I presume that 
clear() is required with iterparse() if we're going to process files of 
arbitrary length.

I should think that this approach provides an intermediate solution. 
It's more work than building the full tree in memory because the 
programmer has to do some additional housekeeping to call clear() at the 
right time and place.  But it's less housekeeping than SAX.

I guess I've done enough SAX, in enough different languages, that I 
don't find it that onerous to use.  When I need an element stack to keep 
track of things I can usually re-use code I've written for other 
applications.  But for a programmer that doesn't do a lot of this stuff, 
I agree, the learning curve with lxml will be shorter and the 
programming and debugging can be faster.

     Alan



More information about the Python-list mailing list