Efficient incremental parsing using etree.iterparse
Hello, I need to process very large XML files as quickly as possible. The XML processing does not require processing of every single tag, so I was looking at the iterparse method. Unfortunately, the iterparse method only allows one tag name to be specified for triggering events, while I need to do processing on two or three different tags. This would still be much more efficient than using the target parser method, because the XML data contains many more tags that do not require immediate processing. So, it looks like I need something in between processing *all* tags and processing a single tag. Is there any way to do that? Thanks for any hints!
Am .11.2014, 11:47 Uhr, schrieb D.H.J. Takken <d.h.j.takken@xs4all.nl>:
I need to process very large XML files as quickly as possible. The XML processing does not require processing of every single tag, so I was looking at the iterparse method.
Unfortunately, the iterparse method only allows one tag name to be specified for triggering events, while I need to do processing on two or three different tags. This would still be much more efficient than using the target parser method, because the XML data contains many more tags that do not require immediate processing.
As noted elsewhere you can pass in a list of tags to do this. However, when running benchmarks in openpyxl we discovered that for pure parsing xml.etree.cElementTree can be *significantly* faster than lxml: 2 to 3 times in our experience. I discussed this with Stefan and he said it's largely down to the different c libraries – you pay a penalty for the richer interface of libmxml2. Source is at https://bitbucket.org/openpyxl/openpyxl/src/c03b4cdc48b077f60abd329b0766148a... if you're interested. We do pass in a list of tags but the compatibility layer ignores them. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226
Charlie Clark schrieb am 21.11.2014 um 20:31:
running benchmarks in openpyxl we discovered that for pure parsing xml.etree.cElementTree can be *significantly* faster than lxml: 2 to 3 times in our experience. I discussed this with Stefan and he said it's largely down to the different c libraries – you pay a penalty for the richer interface of libmxml2.
That's most of the truth, but not all of it. The current iterparse implementation in lxml also isn't optimal. It builds the Python tree events at the same time as the C tree (using the SAX interface in libxml2), so it acquires and releases the GIL too often and ends up doing a lot of different things in the main parse loop. This could be split so that libxml2 could first grow its C tree in memory from the data chunks it parses, and only when that's done, and several C tree nodes were constructed, the GIL would be acquired and the Python tree events created by traversing the newly added parts of the C tree in one go. It's still unlikely that rewriting this will get you all the performance of cET (which only builds a single (Python) tree and its events in one go, with the GIL held all the time), but it would at least reduce the penalty. Stefan
On 11/21/2014 08:31 PM, Charlie Clark wrote:
As noted elsewhere you can pass in a list of tags to do this. However, when running benchmarks in openpyxl we discovered that for pure parsing xml.etree.cElementTree can be *significantly* faster than lxml: 2 to 3 times in our experience. I discussed this with Stefan and he said it's largely down to the different c libraries – you pay a penalty for the richer interface of libmxml2.
Is cET still faster when every single tag is yielded by the iterator? The cET iterparse implementation does not appear to feature tag filtering, so I need to use Python if statements to do the filtering myself. I can imagine that this pretty much defeats the performance advantage...
Am .11.2014, 09:51 Uhr, schrieb D.H.J. Takken <d.h.j.takken@xs4all.nl>:
Is cET still faster when every single tag is yielded by the iterator? The cET iterparse implementation does not appear to feature tag filtering, so I need to use Python if statements to do the filtering myself. I can imagine that this pretty much defeats the performance advantage...
Don't imagine, benchmark. Based on my own tests it's significantly faster. This is backed up by what Stefan says about the implementation. Filtering through dictionary dispatch is very fast. Charlie -- Charlie Clark Managing Director Clark Consulting & Research German Office Kronenstr. 27a Düsseldorf D- 40217 Tel: +49-211-600-3657 Mobile: +49-178-782-6226
D.H.J. Takken schrieb am 24.11.2014 um 09:51:
On 11/21/2014 08:31 PM, Charlie Clark wrote:
As noted elsewhere you can pass in a list of tags to do this. However, when running benchmarks in openpyxl we discovered that for pure parsing xml.etree.cElementTree can be *significantly* faster than lxml: 2 to 3 times in our experience. I discussed this with Stefan and he said it's largely down to the different c libraries – you pay a penalty for the richer interface of libmxml2.
Is cET still faster when every single tag is yielded by the iterator? The cET iterparse implementation does not appear to feature tag filtering, so I need to use Python if statements to do the filtering myself. I can imagine that this pretty much defeats the performance advantage...
You should try it. It can make a difference as it reduces the amount of interpreted code being executed, but in the back, lxml would still instantiate the Element objects (if only for tree memory management), so don't expect wonders. It really depends a lot on your code and the task at hand, including the work you do during each iteration step for actual tree processing. Selective iterparsing is one of the specific cases where optimising lxml's iterparse implementation could show a lot of improvement, though, as it could then reduce the number of instantiated Element objects to what's really requested on user side. Stefan
participants (3)
-
Charlie Clark
-
D.H.J. Takken
-
Stefan Behnel