Tremendous slowdown due to garbage collection

Wed Apr 30 10:12:47 EDT 2008

On Apr 12, 11:11 am, andreas.eis... at gmail.com wrote:
> I should have been more specific about possible fixes.
>
> > > python2.5 -m timeit 'gc.disable();l=[(i,) for i in range(2000000)]'
>
> > 10 loops, best of 3: 662 msec per loop
>
> > > python2.5 -m timeit 'gc.enable();l=[(i,) for i in range(2000000)]'
>
> > 10 loops, best of 3: 15.2 sec per loop
>
> > In the latter case, thegarbagecollector apparently consumes about
> > 97% of the overall time.
>
> In a related thread onhttp://bugs.python.org/issue2607
> Amaury Forgeot d'Arc suggested a setting of the GC
> thresholds that actually solves the problem for me:
>
> > Disabling the gc may not be a good idea in a real application; I suggest
> > you to play with the gc.set_threshold function and set larger values, at
> > least while building the dictionary. (700, 1000, 10) seems to yield good
> > results.
> > python2.5 -m timeit 'gc.set_threshold(700,1000,10);l=[(i,) for i in range(2000000)]'
>
> 10 loops, best of 3: 658 msec per loop
>
> which made me suggest to use these as defaults, but then
> Martin v. Löwis wrote that
>
> > No, the defaults are correct for typical applications.
>
> At that point I felt lost and as the general wish in that thread was
> to move
> discussion to comp.lang.python, I brought it up here, in a modified
> and simplified form.
>
> > I would suggest to configure the default behaviour of thegarbage
> > collector in such a way that this squared complexity is avoided
> > without requiring specific knowledge and intervention by the user. Not
> > being an expert in these details I would like to ask the gurus how
> > this could be done.
> > I hope this should be at least technically possible, whether it is
> > really desirable or important for a default installation of Python
> > could then be discussed once the disadvantages of such a setting would
> > be apparent.
>
> I still don't see what is so good about defaults that lead to O(N*N)
> computation for a O(N) problem, and I like Amaury's suggestion a lot,
> so I would like to see comments on its disadvantages. Please don't
> tell me that O(N*N) is good enough. For N>1E7 it isn't.
>
> About some other remarks made here:
>
> > I think the linguists of the world should write better automated
> > translation systems. Not being an expert in these details I would like
> > to ask the gurus how it could be done.
>
> I fully agree, and that is why I (as someone involved with MT) would
> prefer to focus on MT and not on memory allocation issues, and by the
> way, this is why I normally prefer to work in Python, as it normally
> lets me focus on the problem instead of the tool.
>
> > There are going to be pathological cases in any memory allocation
> > scheme. The fact that you have apparently located one calls for you to
> > change your approach, not for the finely-tuned well-conceivedgarbage
> > collection scheme to be abandoned for your convenience.
>
> I do not agree at all. Having to deal with N>1E7 objects is IMHO not
> pathological, it is just daily practice in data-oriented work (we're
> talking about deriving models from Gigawords, not about toy systems).
>
> > If you had made a definite proposal that could have been evaluated you
> > request would perhaps have seemed a little more approachable.
>
> I feel it is ok to describe a generic problem without having found
> the answer yet.
> (If you disagree: Where are your definite proposals wrt. MT ? ;-)
> But I admit, I should have brought up Amaury's definite proposal
> right away.
>
> > A question often asked ... of people who are creating
> > very large data structures in Python is "Why are you doing that?"
> > That is, you should consider whether some kind of database solution
> > would be better.  You mention lots of dicts--it sounds like some
> > balanced B-trees with disk loading on demand could be a good choice.
>
> I do it because I have to count frequencies of pairs of things that
> appear in real data. Extremely simple stuff, only interesting because
> of the size of the data set. After Amaury's hint to switch GC
> temporarily
> off I can count 100M pairs tokens (18M pair types) in about 15
> minutes,
> which is very cool, and I can do it in only a few lines of code, which
> is even cooler.
> Any approach that would touch the disk would be too slow for me, and
> bringing in complicated datastructures by myself would distract me
> too much from what I need to do. That's exactly why I use Python;
> it has lots of highly fine-tuned complicated stuff behind the scenes,
> that is just doing the right thing, so I should not have to (and I
> could
> not) re-invent this myself.
>
> Thanks for the helpful comments,
> Andreas

In the issue at bugs.python.org, why do you say "Do I have to switch
to Perl or C to get this done???", as if Perl and C were similar
languages? If we were to look at Python, Perl, and C together, Python
and Perl would be similar languages, and C would be quite something
different, *specially* regarding the discussed issue: speed. So what
do you mean by saying "Perl or C"?