I note that in r60567 Christian checked in support for compacting the int and float freelists. I have no problems with the implementation, but would like to note that I have been experimenting with an alternate approach which doesn't require the addition of a knob to initiate the compacting. Probably in response to the same stimulus as Christian it occurred to me that the freelist approach had been adopted long before PyMalloc was enabled as standard (in 2.3), and that much of the performance gains between 2.2 and 2.3 were in fact due to PyMalloc. So I've been testing with the freelists ripped out and ints and floats reverted to fairly standard PyObject_New allocation (retaining the small int interning) and thus relying on PyMalloc to do any compaction. The results have been somewhat surprising... The short version is that: - for ints, the freelist is ahead of PyMalloc by a very small margin (<4%) - for floats, the freelist is a long way behind PyMalloc (>35% slower) This without invoking freelist compaction by the way (though PyMalloc is releasing empty arenas). I don't know what's pessimising the float freelist, but the results are similar on 2 boxes: - AMD Athlon XP-1600+, 512MB RAM, FreeBSD 6.1, gcc 3.4.4 (~27k pystones) - AMD Athlon XP-2500+, 512MB RAM, OS/2 v4 with EMX, gcc 2.8.1 (~38k pystones) If there's interest in following this up, I can put my patches to intobject.c and floatobject.c into the tracker. Test scripts: a) int <code> # test routine - integer import time def b(time_now=time.clock): limit_val = 2000000 start_time = time_now() for j in xrange(25): d = {} for i in xrange(limit_val): d[i] = i + limit_val return time_now() - start_time if __name__ == '__main__': print 'elapsed: %s s' % b() </code> b) float <code> # test routine - float import time def b(time_now=time.clock): limit_val = 1000000 vals = range(limit_val) offset = limit_val * 1.0 start_time = time_now() for j in xrange(25): d = {} for i in [float(x) for x in vals]: d[i] = i + offset return time_now() - start_time if __name__ == '__main__': print 'elapsed: %s s' % b() </code> The times I've obtained were: case XP1600/Fbsd XP2500/OS2 (*) 1) int freelist 38.1s 24.6s 2) int pymalloc 39.0s 25.3s 3) float freelist (with int freelist) 75s 46.1s 4) float pymalloc -with int freelist 55s 29.0s -with int pymalloc 55.5s 29.5s (*) OS/2 tests based on patched 2.5.1 sources rather than svn trunk. FreeBSD tests based on svn trunk checked out at 1050UTC 7Feb08. -- ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia
Andrew MacIntyre wrote:
So I've been testing with the freelists ripped out and ints and floats reverted to fairly standard PyObject_New allocation (retaining the small int interning) and thus relying on PyMalloc to do any compaction.
Nice! What do you mean with int interning? Are you talking about the small int cache?
The results have been somewhat surprising...
The short version is that: - for ints, the freelist is ahead of PyMalloc by a very small margin (<4%) - for floats, the freelist is a long way behind PyMalloc (>35% slower)
This without invoking freelist compaction by the way (though PyMalloc is releasing empty arenas).
I don't know what's pessimising the float freelist, but the results are similar on 2 boxes: - AMD Athlon XP-1600+, 512MB RAM, FreeBSD 6.1, gcc 3.4.4 (~27k pystones) - AMD Athlon XP-2500+, 512MB RAM, OS/2 v4 with EMX, gcc 2.8.1 (~38k pystones)
That's interesting. I wouldn't have thought that floats are much faster with pymalloc. I'd bet that pymalloc is a tiny bit slower until the new free list is filled.
If there's interest in following this up, I can put my patches to intobject.c and floatobject.c into the tracker.
I've uploaded my patch at http://bugs.python.org/issue2039. It keeps a standard free list with 8192 elements for floats and ints. Does your patch also use a free list or does it alloc/free directly? Can your profile my patch and compare it to your results? It may be a bit faster if your patch doesn't use a free list. Christian
On 2008-02-07 14:09, Andrew MacIntyre wrote:
Probably in response to the same stimulus as Christian it occurred to me that the freelist approach had been adopted long before PyMalloc was enabled as standard (in 2.3), and that much of the performance gains between 2.2 and 2.3 were in fact due to PyMalloc.
One of the hopes of having a custom allocator for Python was to be able to get rid off all free lists. For some reason that never happened. Not sure why. People were probably too busy with adding new features to the language at the time ;-) Something you could try to make PyMalloc perform better for the builtin types is to check the actual size of the allocated PyObjects and then make sure that PyMalloc uses arenas large enough to hold a good quantity of them, e.g. it's possible that the float types fall into the same arena as some other type and thus don't have enough "room" to use as free list. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 08 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
One of the hopes of having a custom allocator for Python was to be able to get rid off all free lists. For some reason that never happened. Not sure why. People were probably too busy with adding new features to the language at the time ;-)
Probably not. It's more that the free lists still outperformed pymalloc.
Something you could try to make PyMalloc perform better for the builtin types is to check the actual size of the allocated PyObjects and then make sure that PyMalloc uses arenas large enough to hold a good quantity of them, e.g. it's possible that the float types fall into the same arena as some other type and thus don't have enough "room" to use as free list.
I don't think any improvements can be gained here. PyMalloc carves out pools of 4096 bytes from an arena when it runs out of blocks for a certain size class, and then keeps a linked list of pools of the same size class. So when many float objects get allocated, you'll have a lot of pools of the float type's size class. IOW, PyMalloc has always enough room. Regards, Martin
On 2008-02-08 08:21, Martin v. Löwis wrote:
One of the hopes of having a custom allocator for Python was to be able to get rid off all free lists. For some reason that never happened. Not sure why. People were probably too busy with adding new features to the language at the time ;-)
Probably not. It's more that the free lists still outperformed pymalloc.
Something you could try to make PyMalloc perform better for the builtin types is to check the actual size of the allocated PyObjects and then make sure that PyMalloc uses arenas large enough to hold a good quantity of them, e.g. it's possible that the float types fall into the same arena as some other type and thus don't have enough "room" to use as free list.
I don't think any improvements can be gained here. PyMalloc carves out pools of 4096 bytes from an arena when it runs out of blocks for a certain size class, and then keeps a linked list of pools of the same size class. So when many float objects get allocated, you'll have a lot of pools of the float type's size class. IOW, PyMalloc has always enough room.
Well, yes, it doesn't run out of memory, but if pymalloc needs to allocate lots of objects of the same size, then performance degrades due to the management overhead involved for checking the free pools as well as creating new arenas as needed. To reduce this overhead, it may be a good idea to preallocate pools for common sizes and make sure they don't drop under a certain threshold. Here's a list of a few object sizes in bytes for Python 2.5 on an AMD64 machine:
import mx.Tools mx.Tools.sizeof(int(0)) 24 mx.Tools.sizeof(float(0)) 24
8-bit strings are var objects:
mx.Tools.sizeof(str('')) 40 mx.Tools.sizeof(str('a')) 41
Unicode objects use an external buffer:
mx.Tools.sizeof(unicode('')) 48 mx.Tools.sizeof(unicode('a')) 48
Lists do as well:
mx.Tools.sizeof(list()) 40 mx.Tools.sizeof(list([1,2,3])) 40
Tuples are var objects:
mx.Tools.sizeof(tuple()) 24 mx.Tools.sizeof(tuple([1,2,3])) 48
Old style classes:
class C: pass ... mx.Tools.sizeof(C) 64
New style classes are a lot heavier:
class D(object): pass ... mx.Tools.sizeof(D) 848
mx.Tools.sizeof(type(2)) 848
As you can see, Integers and floats fall into the same pymalloc size class. What's strange in Andrew's result is that both integers and floats use the same free list technique and fall into the same pymalloc size class, yet the results are different. The only difference that's apparent is that small integers are shared, so depending on the data set used for the test, fewer calls to pymalloc or the free list are made. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 08 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
On 2008-02-08 12:23, M.-A. Lemburg wrote:
On 2008-02-08 08:21, Martin v. Löwis wrote:
One of the hopes of having a custom allocator for Python was to be able to get rid off all free lists. For some reason that never happened. Not sure why. People were probably too busy with adding new features to the language at the time ;-) Probably not. It's more that the free lists still outperformed pymalloc.
Something you could try to make PyMalloc perform better for the builtin types is to check the actual size of the allocated PyObjects and then make sure that PyMalloc uses arenas large enough to hold a good quantity of them, e.g. it's possible that the float types fall into the same arena as some other type and thus don't have enough "room" to use as free list. I don't think any improvements can be gained here. PyMalloc carves out pools of 4096 bytes from an arena when it runs out of blocks for a certain size class, and then keeps a linked list of pools of the same size class. So when many float objects get allocated, you'll have a lot of pools of the float type's size class. IOW, PyMalloc has always enough room.
Well, yes, it doesn't run out of memory, but if pymalloc needs to allocate lots of objects of the same size, then performance degrades due to the management overhead involved for checking the free pools as well as creating new arenas as needed.
To reduce this overhead, it may be a good idea to preallocate pools for common sizes and make sure they don't drop under a certain threshold.
Here's a list of a few object sizes in bytes for Python 2.5 on an AMD64 machine:
import mx.Tools mx.Tools.sizeof(int(0)) 24 mx.Tools.sizeof(float(0)) 24
8-bit strings are var objects:
mx.Tools.sizeof(str('')) 40 mx.Tools.sizeof(str('a')) 41
Unicode objects use an external buffer:
mx.Tools.sizeof(unicode('')) 48 mx.Tools.sizeof(unicode('a')) 48
Lists do as well:
mx.Tools.sizeof(list()) 40 mx.Tools.sizeof(list([1,2,3])) 40
Tuples are var objects:
mx.Tools.sizeof(tuple()) 24 mx.Tools.sizeof(tuple([1,2,3])) 48
Old style classes:
class C: pass ... mx.Tools.sizeof(C) 64
New style classes are a lot heavier:
class D(object): pass ... mx.Tools.sizeof(D) 848 mx.Tools.sizeof(type(2)) 848
As you can see, Integers and floats fall into the same pymalloc size class. What's strange in Andrew's result is that both integers and floats use the same free list technique and fall into the same pymalloc size class, yet the results are different.
The only difference that's apparent is that small integers are shared, so depending on the data set used for the test, fewer calls to pymalloc or the free list are made.
Here's the same on a 32-bit machine. Notice the differences:
import mx.Tools mx.Tools.sizeof(int(0)) 12 mx.Tools.sizeof(float(0)) 16 mx.Tools.sizeof(str('')) 24 mx.Tools.sizeof(str('a')) 25 mx.Tools.sizeof(unicode('')) 24 mx.Tools.sizeof(unicode('a')) 24 mx.Tools.sizeof(list()) 20 mx.Tools.sizeof(list([1,2,3])) 20 mx.Tools.sizeof(tuple()) 12 mx.Tools.sizeof(tuple([1,2,3])) 24 class C: pass mx.Tools.sizeof(C) 32 class D(object): pass mx.Tools.sizeof(D) 424 mx.Tools.sizeof(type(2)) 424
Floats use 4 bytes more than integers. However, they still fall into the same pymalloc size class (16 bytes this time around). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 08 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
M.-A. Lemburg wrote:
As you can see, Integers and floats fall into the same pymalloc size class. What's strange in Andrew's result is that both integers and floats use the same free list technique and fall into the same pymalloc size class, yet the results are different.
My take is not that PyMalloc is so much better in the float case, but that the float freelist implementation is suboptimal (though exactly why is subject for conjecture, given it's almost identical to the int implementation).
The only difference that's apparent is that small integers are shared, so depending on the data set used for the test, fewer calls to pymalloc or the free list are made.
My testing was done with millions of unique integers, so the small int handling should be deep in the measurement noise. -- ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia
Well, yes, it doesn't run out of memory, but if pymalloc needs to allocate lots of objects of the same size, then performance degrades due to the management overhead involved for checking the free pools as well as creating new arenas as needed.
To reduce this overhead, it may be a good idea to preallocate pools for common sizes and make sure they don't drop under a certain threshold.
I still can't see why this could speed up anything. The total time will be the same - whether you allocate them in advance or on demand should make no difference. In any case, if you think that you can improve things, please submit a patch. Regards, Martin
M.-A. Lemburg wrote:
On 2008-02-07 14:09, Andrew MacIntyre wrote:
Probably in response to the same stimulus as Christian it occurred to me that the freelist approach had been adopted long before PyMalloc was enabled as standard (in 2.3), and that much of the performance gains between 2.2 and 2.3 were in fact due to PyMalloc.
One of the hopes of having a custom allocator for Python was to be able to get rid off all free lists. For some reason that never happened. Not sure why. People were probably too busy with adding new features to the language at the time ;-)
Very probably ;-)
Something you could try to make PyMalloc perform better for the builtin types is to check the actual size of the allocated PyObjects and then make sure that PyMalloc uses arenas large enough to hold a good quantity of them, e.g. it's possible that the float types fall into the same arena as some other type and thus don't have enough "room" to use as free list.
Like MvL, I doubt it. Uncle Timmy did a pretty thorough nose-clean on PyMalloc. However, my tests do show that something is funny with the current freelist implementation for floats on at least 2 platforms, and that doing without that sort of optimisation for float objects would likely not be a hardship with PyMalloc. -- ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia
Andrew MacIntyre wrote:
However, my tests do show that something is funny with the current freelist implementation for floats on at least 2 platforms, and that doing without that sort of optimisation for float objects would likely not be a hardship with PyMalloc.
float objects are slightly larger than int objects. On a 32bit OS both have ob_type pointer of size 4, a ref counter of size 4. But an int object has a value of type long with 4 bytes and a float stores its value in a double with size 8. I assume that the difference in size leads to a different allocation timing. Christian
On Feb 8, 2008 10:54 AM, Christian Heimes <lists@cheimes.de> wrote:
Andrew MacIntyre wrote:
However, my tests do show that something is funny with the current freelist implementation for floats on at least 2 platforms, and that doing without that sort of optimisation for float objects would likely not be a hardship with PyMalloc.
float objects are slightly larger than int objects. On a 32bit OS both have ob_type pointer of size 4, a ref counter of size 4. But an int object has a value of type long with 4 bytes and a float stores its value in a double with size 8.
I assume that the difference in size leads to a different allocation timing.
It's not just size. Architectures may require data aligned on 4, 8, or 16 addresses for optimal performance depending on data type. IIRC, malloc aligns by 8 (not sure if that was a particular arch or very common). I don't know if pymalloc handles alignment. n
[Neal Norwitz]
It's not just size. Architectures may require data aligned on 4, 8, or 16 addresses for optimal performance depending on data type. IIRC, malloc aligns by 8 (not sure if that was a particular arch or very common).
Just very common. Because malloc has no idea what the pointer it returns will be used for, it needs to satisfy the strictest alignment requirement for any type exposed by the C language. As an extreme example, when I worked at Kendall Square Research, our hardware supported atomic locking on a "subpage" basis, and HW subpage addresses were all those divisible by 128 Subpages were exposed via C extensions, so the KSR malloc had to return 128-byte aligned pointers.
I don't know if pymalloc handles alignment.
pymalloc ensures 8-byte alignment. This is one plausible reason to keep the current int free list: an int object struct holds 3 4-byte members on most boxes (type pointer, refcount, and the int's value), and the int freelist code uses exactly 12 bytes for each on most boxes. To keep 8-byte alignment, pymalloc would have to hand out a 16-byte chunk per int object, wasting a fourth of the space (pymalloc always rounds up a requested size to a multiple of 8, and ensures the address returned is 8-byte aligned).
Tim Peters wrote:
pymalloc ensures 8-byte alignment. This is one plausible reason to keep the current int free list: an int object struct holds 3 4-byte members on most boxes (type pointer, refcount, and the int's value), and the int freelist code uses exactly 12 bytes for each on most boxes. To keep 8-byte alignment, pymalloc would have to hand out a 16-byte chunk per int object, wasting a fourth of the space (pymalloc always rounds up a requested size to a multiple of 8, and ensures the address returned is 8-byte aligned).
Given the background information Python's long implementation could probably optimized. In 3.0 a long has 3 4-byte members (ref count, ob_size and *ob_type) plus one to many unsigned shorts (2 bytes each) to hold the value. If pymalloc aligns the objects at 8 byte address boundaries wouldn't it be better and slightly faster to use unsigned ints instead of unsigned shorts? Christian
Given the background information Python's long implementation could probably optimized. In 3.0 a long has 3 4-byte members (ref count, ob_size and *ob_type) plus one to many unsigned shorts (2 bytes each) to hold the value. If pymalloc aligns the objects at 8 byte address boundaries wouldn't it be better and slightly faster to use unsigned ints instead of unsigned shorts?
You mean, as the digits? You would have to rewrite the entire long datatype, which is a tedious exercise. Plus, I believe you would have to make some operations 64-bit on all systems: when you multiply digits today, the product will be 32-bit. If you extend the digits, you might get 64-bit results, which will be slow on 32-bit systems (plus some 32-bit systems don't support a 64-bit integer type at all in their compilers). Regards, Martin
Martin v. Löwis wrote:
You mean, as the digits? You would have to rewrite the entire long datatype, which is a tedious exercise. Plus, I believe you would have to make some operations 64-bit on all systems: when you multiply digits today, the product will be 32-bit. If you extend the digits, you might get 64-bit results, which will be slow on 32-bit systems (plus some 32-bit systems don't support a 64-bit integer type at all in their compilers).
I pass! :) It may be worth a try for Python 4.0 when most OS are 64bit. Christian
Tim Peters wrote:
pymalloc ensures 8-byte alignment. This is one plausible reason to keep the current int free list: an int object struct holds 3 4-byte members on most boxes (type pointer, refcount, and the int's value), and the int freelist code uses exactly 12 bytes for each on most boxes. To keep 8-byte alignment, pymalloc would have to hand out a 16-byte chunk per int object, wasting a fourth of the space (pymalloc always rounds up a requested size to a multiple of 8, and ensures the address returned is 8-byte aligned).
Hmmm... the funny thing is that the freelist approach is showing higher memory consumption than the PyMalloc approach for the int case... -- ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia
Neal Norwitz wrote:
It's not just size. Architectures may require data aligned on 4, 8, or 16 addresses for optimal performance depending on data type. IIRC, malloc aligns by 8 (not sure if that was a particular arch or very common). I don't know if pymalloc handles alignment.
Yes, pymalloc takes care of alignment. From Object/obmalloc.c: Small requests are grouped in size classes spaced 8 bytes apart, due to the required valid alignment of the returned address.
On Feb 7, 2008 5:09 AM, Andrew MacIntyre <andymac@bullseye.apana.org.au> wrote:
So I've been testing with the freelists ripped out and ints and floats reverted to fairly standard PyObject_New allocation (retaining the small int interning) and thus relying on PyMalloc to do any compaction.
The results have been somewhat surprising...
The short version is that: - for ints, the freelist is ahead of PyMalloc by a very small margin (<4%) - for floats, the freelist is a long way behind PyMalloc (>35% slower)
Martin did some profiling of ints in py3k 1.5 years ago. The results of his profiling are here: http://svn.python.org/projects/python/branches/py3k/INTBENCH I think Martin wrote a mail to describe his work in more detail. But the only mails I could find are not what I remember: http://mail.python.org/pipermail/python-3000/2006-August/003185.html http://mail.python.org/pipermail/python-3000/2006-August/003064.html I don't remember if he did any work on head or if he remembers any more that might be relevant here. n
I've done my own profiling with a different script. I was able to verify Andrews results. The pymalloc approach makes int creation slightly slower and has almost no effect on floats. The creation of 1000 times a list of 1000 ints (range(1000)) is about 20msec slower. It almost doubles the time for a trivial test "for i in range(1000): range(1000)" but that's an artificial test. Current allocation schema ------------------------- for i in range(1000): [float(x) for x in range(1000)] 10 loops, best of 3: 760 msec per loop for i in range(1000): range(1000) 10 loops, best of 3: 27 msec per loop for i in range(1000): [x for x in range(1000)]" 10 loops, best of 3: 218 msec per loop Without a free list ------------------- for i in range(1000): [float(x) for x in range(1000)] 10 loops, best of 3: 792 msec per loop for i in range(1000): range(1000)" 10 loops, best of 3: 51.5 msec per loop for i in range(1000): [x for x in range(1000)] 10 loops, best of 3: 241 msec per loop With a fixed free list of 2,500 objects each -------------------------------------------- for i in range(1000): [float(x) for x in range(1000)]" 10 loops, best of 3: 736 msec per loop for i in range(1000): range(1000)" 10 loops, best of 3: 25 msec per loop for i in range(1000): [x for x in range(1000)]" 10 loops, best of 3: 198 msec per loop As you can clearly see an approach with a small free list in a fixed size array like static PyFloatObject *free_list[PyFloat_MAXFREELIST] is even faster than block allocation. A small free list with 80 objects each would speed up the creation of floats and ints a bit *and* result in a quicker return of memory to the OS. Christian
participants (6)
-
"Martin v. Löwis"
-
Andrew MacIntyre
-
Christian Heimes
-
M.-A. Lemburg
-
Neal Norwitz
-
Tim Peters