[Python-Dev] Windows Low Fragmentation Heap yields speedup of ~15%

Wed Feb 23 12:46:56 CET 2005

> A well-known trick is applicable in that case, if Martin thinks it's
> worth the bother:
> grow the list to its final size once, at the start (overestimating if
> you don't know for sure).  Then instead of appending, keep an index to
> the next free slot, same as you'd do in C.  Then the list guts never
> move, so if that doesn't yield the same kind of speedup without using
> LFH, list copying wasn't actually the culprit to begin with.

I actually did that in Py2.1, and removed it when we switched to Zope 2.7/Py 2.3,
because it was no longer worth it, presumably due to obmalloc becoming
enabled. Unfortunately, I have lost the speedup gained by my Fast-Append-List
in Py 2.1, but I recall it saved about 5% in a similar test than the
one I have today. 

> Yes.  For example, a 300-character string could do it (that's not
> small to obmalloc, but is to LFH).  Strings produced by pickling are
> very often that large, and especially in Zope (which uses pickles
> extensively under the covers -- reading and writing persistent objects
> in Zope all involve pickle strings).

With my amateur (in C-stuff) knowledge, this sounds as a very likely point. 
>From Evan's 17 Feb mail, it sounds that cPickle would not use obmalloc - 
if this is the case, wouldn't that be an obvious (and relatively easy) speedup?

> If someone were motivated enough, it would probably be easiest to 
> run Martin's app on a Linux box, and use the free Linux tools to analyze it.

As it is often the case, real-life figures do not come from a benchmark,
but from an application test with real data, so putting the whole thing
up on Linux would need quite some time - which I don't have :-(

Best regards, Martin

-----Original Message-----
From: Tim Peters [mailto:tim.peters at gmail.com] 
Sent: Friday, 18 Feb 2005 23:52
To: Evan Jones
Cc: Gfeller Martin; Martin v. Löwis; Python Dev
Subject: Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15%

[Tim Peters]
...
>> Then you allocate a small object, marked 's':
>>
>> bbbbbbbbbbbbbbbsfffffffffffffffffffffffffffffff

[Evan Jones]
> Isn't the whole point of obmalloc

No, because it doesn't matter what follows that introduction: 
obmalloc has several points, including exploiting the GIL, heuristics
aiming at reusing memory while it's still high in the memory
heirarchy, almost never touching a piece of memory until it's actually
needed, and so on.

> is that we don't want to allocate "s" on the heap, since it is small?

That's one of obmalloc's goals, yes.  But "small" is a relative
adjective, not absolute.  Because we're primarily talking about LFH
here, the natural meaning for "small" in _this_ thread is < 16KB,
which is much larger than "small" means to obmalloc.  The memory-map
example applies just well to LFH as to obmalloc, by changing which
meaning for "small" you have in mind.

> I guess "s" could be an object that might potentially grow.

For example, list guts in Python are never handled by obmalloc,
although the small fixed-size list _header_ object is always handled
by obmalloc.

>> One thing to take from that is that LFH can't be helping list-growing
>> in a direct way either, if LFH (as seems likely) also needs to copy
>> objects that grow in order to keep its internal memory segregated by
>> size.  The indirect benefit is still available, though:  LFH may be
>> helping simply by keeping smaller objects out of the general heap's
>> hair.

> So then wouldn't this mean that there would have to be some sort of
> small object being allocated via the system malloc that is causing the
> poor behaviour?

Yes.  For example, a 300-character string could do it (that's not
small to obmalloc, but is to LFH).  Strings produced by pickling are
very often that large, and especially in Zope (which uses pickles
extensively under the covers -- reading and writing persistent objects
in Zope all involve pickle strings).

> As you mention, I wouldn't think it would be list objects, since resizing
> lists using LFH should be *worse*.

Until they get to LFH's boundary for "small", and we have only the
vaguest idea what Martin's app does here -- we know it grows lists
containing 50K elements in the end, and ... well, that's all I really
know about it <wink>.

A well-known trick is applicable in that case, if Martin thinks it's
worth the bother:
grow the list to its final size once, at the start (overestimating if
you don't know for sure).  Then instead of appending, keep an index to
the next free slot, same as you'd do in C.  Then the list guts never
move, so if that doesn't yield the same kind of speedup without using
LFH, list copying wasn't actually the culprit to begin with.

> That would actually be something that is worth verifying, however.

Not worth the time to me -- Windows is closed-source, and I'm too old
to enjoy staring at binary disassemblies any more.  Besides, list guts
can't stay in LFH after the list exceeds 4K elements.  If list-copying
costs are significant here, they're far more likely to be due to
copying lists over 4K elements than under -- copying a list takes
O(len(list)) time.  So the realloc() strategy used by LFH _probably_
isn't of _primary)_ interest here.

> It could be that the Windows LFH is extra clever?

Sure -- that I doubt it moves Heaven & Earth to cater to reallocs is
just educated guessing.  I wrote my first production heap manager at
Cray Research, around 1979 <wink>.

> ...
> Well, it would also be useful to find out what code is calling the
> system malloc. This would make it easy to examine the code and see if
> it should be calling obmalloc or the system malloc. Any good ideas for
> easily obtaining this information? I imagine that some profilers must
> be able to produce a complete call graph?

Windows supports extensive facilities for analyzing heap usage, even
from an external process that attaches to the process you want to
analyze.  Ditto for profiling.  But it's not easy, and I don't know of
any free tools that are of real help.  If someone were motivated
enough, it would probably be easiest to run Martin's app on a Linux
box, and use the free Linux tools to analyze it.