RE: [Python-Dev] Windows Low Fragmentation Heap yields speedup of ~15%
A well-known trick is applicable in that case, if Martin thinks it's worth the bother: grow the list to its final size once, at the start (overestimating if you don't know for sure). Then instead of appending, keep an index to the next free slot, same as you'd do in C. Then the list guts never move, so if that doesn't yield the same kind of speedup without using LFH, list copying wasn't actually the culprit to begin with.
I actually did that in Py2.1, and removed it when we switched to Zope 2.7/Py 2.3, because it was no longer worth it, presumably due to obmalloc becoming enabled. Unfortunately, I have lost the speedup gained by my Fast-Append-List in Py 2.1, but I recall it saved about 5% in a similar test than the one I have today.
Yes. For example, a 300-character string could do it (that's not small to obmalloc, but is to LFH). Strings produced by pickling are very often that large, and especially in Zope (which uses pickles extensively under the covers -- reading and writing persistent objects in Zope all involve pickle strings).
With my amateur (in C-stuff) knowledge, this sounds as a very likely point.
From Evan's 17 Feb mail, it sounds that cPickle would not use obmalloc - if this is the case, wouldn't that be an obvious (and relatively easy) speedup?
If someone were motivated enough, it would probably be easiest to run Martin's app on a Linux box, and use the free Linux tools to analyze it.
As it is often the case, real-life figures do not come from a benchmark, but from an application test with real data, so putting the whole thing up on Linux would need quite some time - which I don't have :-( Best regards, Martin -----Original Message----- From: Tim Peters [mailto:tim.peters@gmail.com] Sent: Friday, 18 Feb 2005 23:52 To: Evan Jones Cc: Gfeller Martin; Martin v. Löwis; Python Dev Subject: Re: [Python-Dev] Windows Low Fragementation Heap yields speedup of ~15% [Tim Peters] ...
Then you allocate a small object, marked 's':
bbbbbbbbbbbbbbbsfffffffffffffffffffffffffffffff
[Evan Jones]
Isn't the whole point of obmalloc
No, because it doesn't matter what follows that introduction: obmalloc has several points, including exploiting the GIL, heuristics aiming at reusing memory while it's still high in the memory heirarchy, almost never touching a piece of memory until it's actually needed, and so on.
is that we don't want to allocate "s" on the heap, since it is small?
That's one of obmalloc's goals, yes. But "small" is a relative adjective, not absolute. Because we're primarily talking about LFH here, the natural meaning for "small" in _this_ thread is < 16KB, which is much larger than "small" means to obmalloc. The memory-map example applies just well to LFH as to obmalloc, by changing which meaning for "small" you have in mind.
I guess "s" could be an object that might potentially grow.
For example, list guts in Python are never handled by obmalloc, although the small fixed-size list _header_ object is always handled by obmalloc.
One thing to take from that is that LFH can't be helping list-growing in a direct way either, if LFH (as seems likely) also needs to copy objects that grow in order to keep its internal memory segregated by size. The indirect benefit is still available, though: LFH may be helping simply by keeping smaller objects out of the general heap's hair.
So then wouldn't this mean that there would have to be some sort of small object being allocated via the system malloc that is causing the poor behaviour?
Yes. For example, a 300-character string could do it (that's not small to obmalloc, but is to LFH). Strings produced by pickling are very often that large, and especially in Zope (which uses pickles extensively under the covers -- reading and writing persistent objects in Zope all involve pickle strings).
As you mention, I wouldn't think it would be list objects, since resizing lists using LFH should be *worse*.
Until they get to LFH's boundary for "small", and we have only the vaguest idea what Martin's app does here -- we know it grows lists containing 50K elements in the end, and ... well, that's all I really know about it <wink>. A well-known trick is applicable in that case, if Martin thinks it's worth the bother: grow the list to its final size once, at the start (overestimating if you don't know for sure). Then instead of appending, keep an index to the next free slot, same as you'd do in C. Then the list guts never move, so if that doesn't yield the same kind of speedup without using LFH, list copying wasn't actually the culprit to begin with.
That would actually be something that is worth verifying, however.
Not worth the time to me -- Windows is closed-source, and I'm too old to enjoy staring at binary disassemblies any more. Besides, list guts can't stay in LFH after the list exceeds 4K elements. If list-copying costs are significant here, they're far more likely to be due to copying lists over 4K elements than under -- copying a list takes O(len(list)) time. So the realloc() strategy used by LFH _probably_ isn't of _primary)_ interest here.
It could be that the Windows LFH is extra clever?
Sure -- that I doubt it moves Heaven & Earth to cater to reallocs is just educated guessing. I wrote my first production heap manager at Cray Research, around 1979 <wink>.
... Well, it would also be useful to find out what code is calling the system malloc. This would make it easy to examine the code and see if it should be calling obmalloc or the system malloc. Any good ideas for easily obtaining this information? I imagine that some profilers must be able to produce a complete call graph?
Windows supports extensive facilities for analyzing heap usage, even from an external process that attaches to the process you want to analyze. Ditto for profiling. But it's not easy, and I don't know of any free tools that are of real help. If someone were motivated enough, it would probably be easiest to run Martin's app on a Linux box, and use the free Linux tools to analyze it.
participants (1)
-
Gfeller Martin