Trying to parse a HUGE(1gb) xml file

Stefan Behnel stefan_ml at behnel.de
Mon Dec 27 16:55:38 EST 2010


Alan Meyer, 27.12.2010 21:40:
> On 12/21/2010 3:16 AM, Stefan Behnel wrote:
>> Adam Tauno Williams, 20.12.2010 20:49:
> ...
>>> You need to process the document as a stream of elements; aka SAX.
>>
>> IMHO, this is the worst advice you can give.
>
> Why do you say that? I would have thought that using SAX in this
> application is an excellent idea.

 From my experience, SAX is only practical for very simple cases where 
little state is involved when extracting information from the parse events. 
A typical example is gathering statistics based on single tags - not a very 
common use case. Anything that involves knowing where in the XML tree you 
are to figure out what to do with the event is already too complicated. The 
main drawback of SAX is that the callbacks run into separate method calls, 
so you have to do all the state keeping manually through fields of the SAX 
handler instance.

My serious advices is: don't waste your time learning SAX. It's simply too 
frustrating to debug SAX extraction code into existence. Given how simple 
and fast it is to extract data with ElementTree's iterparse() in a memory 
efficient way, there is really no reason to write complicated SAX code instead.

Stefan




More information about the Python-list mailing list