List problem

Bengt Richter bokr at oz.net
Mon Nov 1 21:36:06 CET 2004


On Mon, 01 Nov 2004 00:13:18 -0700, Josiah Carlson <jcarlson at uci.edu> wrote:

>
>bokr at oz.net (Bengt Richter) wrote:
>
>>  delsome_br
>>  3.39048024526
>>  3.63548729364
>>  3.6326486655
>>  delsome_am
>>  6.40547292869
>>  6.32403355062
>>  6.21752171923
>
>FYI: list comprehensions in 2.4 are significantly faster than list
>comprehensions in 2.3, which is likely where a large portion of your
>difference is coming from.
>
Sounds reasonable.

>I wasn't sure that I believed your statements in regards to Python's
>cache performance, so I thought I'd give it a look...
I'm not sure which statements you are referring to, but if it's about
bad jobsharing between multiple processors, I don't think that's what's
causing the result you get below. I get a similar slowdown after shuffling.

I would guess only one of your processors is really doing the work of the
interpreter, and the other mostly idling and taking care of i/o and interrupts.

So why would a single processor slow down? The low level in for i in a: pass must
be to bind i to successive integers, however laid out. Meaning incrementing and
decrementing ref counts on the integer objects, and doing that either in linear
memory order or shuffled order, with the integer objects themselves not moving
at all, even with the shuffle of references to them in a.

So how big an L2 cache? 512k bytes? How big is an int object? Even if type info is
shared for an arena, it must be a value,refcount pair each at least, or 8 bytes.
I guess I'd bet on no tricks for type info, so that's a pointer for that -> 12 bytes.
so 512k/12 is about 43k integer objects in L2 cache. But we have a million, so if they
are laid out linearly, we should have 1e6*12/512k or about 22.89 int objects modulo sharing
each cache space that can span an integer. Ok, but L2 cache communicates with CPU cache
which is much smaller, maybe 16 or 32k. 8k 32-bit values in 2k 128-bit chunks. Maybe ;-)
or about 2700 12-byte int objects.

But cache lines probably go back and forth to L2 memory 128 bits wide or 16 bytes.
Which doesn't match 12 bytes, so what happens when you walk orderly through type,refount,value
triplets presumably reading type and reading and writing refount twice, and ignoring the value
since were doing pass?

Anyway, I'm not sure how smart L2 cache is with communicating to slower DRAM, which might
be interleaved in some cases, and have narrower data path. So many factors.

But you can see reading 3*16 will get you 4*12 where randomly you might have to read 4*16
or even more to get your 4*12, depending on the overlaps. If that is what's doing it,
going to a different int representation, where only refcount and value are stored, and the type
info is separate, then the 8 byte lineup with cache chunks ought to make noticeable speedup
for int ops. Of course, I don't know what percentage of all ops that is. I guess someone
instruments the interpreter for that kind of statistics? Timbot?

>
>Python 2.3.2 (#49, Oct  2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on wi=
>n32
>Type "help", "copyright", "credits" or "license" for more information.
>>>> a =3D range(1000000)
>>>> import time
>>>> def f(a):
>=2E..     t =3D time.time()
>=2E..     for i in a:
>=2E..             pass
>=2E..     return time.time()-t
>=2E..
>>>> for i in xrange(10):
>=2E..     print f(a),
>=2E..
>0.421999931335 0.421000003815 0.421999931335 0.422000169754 0.421999931335 =
>0.437
>99996376 0.421000003815 0.421999931335 0.43799996376 0.406000137329
>>>> import random
>>>> random.shuffle(a)
>>>> for i in xrange(10):
>=2E..     print f(a),
>=2E..
>0.594000101089 0.59299993515 0.610000133514 0.608999967575 0.608999967575 0.=
>5940
>00101089 0.593999862671 0.608999967575 0.594000101089 0.593999862671
>>>> a.sort()
>>>> for i in xrange(10):
>=2E..     print f(a),
>=2E..
>0.421999931335 0.421999931335 0.421999931335 0.422000169754 0.43700003624 0.=
>4219
>99931335 0.421999931335 0.421999931335 0.422000169754 0.43700003624
>
>
>Looks to be an almost 50% slowdown on my dual celeron 400mhz machine
>(128kb cache).  Very interesting, I would have never have expected as
>much (due to the amount of overhead of Python data structures, I
>believed it was generally a wash).
>

Regards,
Bengt Richter



More information about the Python-list mailing list