[C++-sig] [pygccxml] caching fails due to pickling recursion limit

Kirill Lapshin kir at lapshin.net
Fri Jun 29 11:19:27 CEST 2007


Roman Yakovenko wrote:
> May I propose to use another format for cache - gccxml generated files.
> I am serious. I am not kidding. Last released version(0.9) has many
> performance improvements, one of them is parsing XML files. pygccxml
> now uses cElementTree iterparse functionality  as XML parser when it
> available. You can find here the benchmarks
> http://effbot.org/zone/celementtree.htm .

Thanks for a prompt response.

I gave it a try. So far looks promising, few caveats though:

1. cElementTree on windows is installed to the root of site-packages, so 
I had to modify your import line, e.g. replace:

    import xml.etree.cElementTree as ElementTree

with

    try:
        import xml.etree.cElementTree as ElementTree
    except:
        import cElementTree as ElementTree

2. cElementTree helps, but not by a wide margin -- parsing used to take 
about 8.1 sec and with cElementTree it takes 5.7 sec, which is nowhere 
near speeds clamed on cElementTree page, but I guess most of the time is 
spent not in xml parsing, but in reader/scanner/whatever.

3. Overall time with xml cache is comparable to pickle cache (still 
about 10% slower though), however cold startup (when no cache is there) 
is about 15% faster.

4. xml as cache is not as robust as old style cache, meaning that there 
is no logic to refresh xml file when it gets outdated. Granted in ideal 
world pyplusplus shouldn't do it in first place, it is more of a build 
system responsibility, but many of us stuck with less then ideal build 
systems that can't automatically scan .hpp file dependency trees.

I would migrate to xml cache in a heartbit, given our problems with 
pickle, but build system deficiencies prohibit this move at the moment.

Are there any plans to obsolete xml cache file automatically whenever 
source file, or any files included by source file are modified? I can 
try to hack something myself, but if you are planning to work on it 
anyway, I may wait for proper solution.

Performance stats for various scenarios (listing biggest offenders only):

Note: I'm using py++/pygccxml 0.9 amended a bit to add more performance 
logging and to use cElementTree as described above.

1. xml using cache
    parsing xml 5.6 sec
    relinking declared types...
     parsing files - done (7.5 sec)
     setting declarations defaults - done (3.1 sec)
    preparing data structures for query optimizer (4.7 sec)
    --- total 26 sec
    --- total (logging off) 23 sec

2. xml not using cache (cache file has been deleted)
    creating xml 11.2 sec
    parsing xml 5.6 sec
    relinking declared types...
     parsing files - done (18.6 sec)
     setting declarations defaults - done (3.1 sec)
    preparing data structures for query optimizer (4.7 sec)
    --- total 39 sec
    --- total (logging off) 34 sec

    note: times do not add up! parsing files time looks suspicious. 
probably it includes parsing xml

3. pickle using cache
    parsing source file 3.1 sec
    relinking declared types...
     parsing files - done (5 sec)
     setting declarations defaults - done (2.7 sec)
    preparing data structures for query optimizer (4.5 sec)
    --- total 25 sec
    --- total (logging off) 21 sec

4. pickle not using cache (cache file has been deleted)
    parsing source file 22.3 sec
    relinking declared types...
     parsing files - done (24.24 sec)
     setting declarations defaults - done (3 sec)
    preparing data structures for query optimizer (4.7 sec)
    --- total 44 sec
    --- total (logging off) 40 sec

Kirill



More information about the Cplusplus-sig mailing list