[Python-bugs-list] [ python-Bugs-604716 ] faster [None]n or []n

Thu, 12 Sep 2002 14:13:55 -0700

Bugs item #604716, was opened at 2002-09-04 16:20
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=604716&group_id=5470

Category: Python Interpreter Core
Group: Not a Bug
Status: Open
Resolution: None
Priority: 5
Submitted By: Gregory Smith (gregsmith)
Assigned to: Nobody/Anonymous (nobody)
Summary: faster [None]*n or []*n

Initial Comment:
This is a suggestion to speed up multiplying
a list by an integer, in the case where 
the length of the list is 0 or 1. currently there
is a lot of overhead in the loop for these cases.
The current loop ( in list_repeat, in listobject.c) :

	p = np->ob_item;
	for (i = 0; i < n; i++) {
		for (j = 0; j < a->ob_size; j++) {
			*p = a->ob_item[j];
			Py_INCREF(*p);
			p++;
		}
	}
	return (PyObject *) np;

Suggested  ( where Py_INCREF_N(o,n) bumps
the refcount of  'o' by n): 

	p = np->ob_item;
        if( a->ob_size <= 1 ){
		if( a->ob_size ){  // is 1
			Pyobject *a0 = a->ob_item[0];
			for (i = 0; i < n; i++) {
				*p++ = a0;
			}
		        Py_INCREF_N(a0,n);
		}
        }else{
		for (i = 0; i < n; i++) {
			for (j = 0; j < a->ob_size; j++) {
				*p = a->ob_item[j];
				Py_INCREF(*p);
				p++;
			}
		}
        }
	return (PyObject *) np;

You could also do special cases for len=2, etc
(with *p++ = a0; *p++ =  a1; inside) but len=1
is a common case, with things like [None]*1000
used to create lists.

The best approach in general is to check
if the list being replicated is sufficiently smaller than
the replication count; and if so, use a different set
of loops nested in the other order, so that the
outer loop runs less often. 

 With the inverted loop case, you have 
'a->ob_size' calls to Py_INCREF_N instead of
'a->ob_size*n' calls to Py_INCREF. 

Exact same comments apply to tuplerepeat().

If any interest, I will code this and test it. 

----------------------------------------------------------------------

>Comment By: Gregory Smith (gregsmith)
Date: 2002-09-12 17:13

Message:
Logged In: YES 
user_id=292741

OK, i'll try that when I have time. The problem is,
you can't do [None]*1000000 over and over again for 10
seconds unless you also destroy the lists, there isn't
enough RAM - and I was trying not to count the dispose
time, which of course is also dependent on the list length.
 I was calling calling clock() immediately before and
after the operation, and accumulating the differences,
thus excluding the rest of the loop with the list delete.
any uncertainty in clock() should in principle average
out over a lot of reps.  For what it's worth, the results
were reasonably repeatable from run to run (maybe
5%) and very linear with respect to changing the 
'*1000000' factor.

I know i'm effectively timing one call to clock() as part
of the timed code; but that won't change with the list
size and can thus be factored out.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-09-11 12:43

Message:
Logged In: YES 
user_id=21627

Please don't invoke clock() twice per iteration. The
standard timing procedure is

iterations = [None] * niter
start = clock()
for i in iterations:
  actions_to_time
total = clock() - start

That way, you don't measure the time it takes to create the
range object.
Also, don't trust any measurements that have a total time of
less then 10s (running it for a minute is even better). With
this algorithm, please re-report the two totals, and the
number of iterations. Run it several times in a row, to see
how results vary.

----------------------------------------------------------------------

Comment By: Gregory Smith (gregsmith)
Date: 2002-09-06 19:10

Message:
Logged In: YES 
user_id=292741

using this test code, and changing the 1000000, I have found
that it adds an additional 40 nsec for each additional
copy of the None. This doesn't change noticeably
when I add the 'improvement'.

If I change it to []*whatever, then the old implementation
needs 5 ns extra for each non-copy (and the new one
isn't sensitive). This is telling me that most of the time
in the [None]*lots is in the PyList_New() to get the list.

So I went and made a variant of PyList_New which doesn't
clear the list, and found that [None]*1000000 takes 22msec
instead of the 40 it used to take.
This doesn't make too much sense, looking at the code;
maybe cache effects. 

Time to go home now...
-------------------------
def timeit():
    from time import clock
    t0 = clock()
    tin = 0
    niter = 50
    for i in range(niter):
        ta,p,tb =  clock(),[None]*1000000,clock()
        tin += tb-ta
        p = 0   # deallocate list
    t0 = clock()-t0
    print "avg. time = ", tin/niter
    print "avg. overhead = ", t0/niter

timeit()
---------------------

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-09-06 15:16

Message:
Logged In: YES 
user_id=80475

Yes, I looked at the all-at-once memcpy approach but 
found that MVC++'s memcpy() was too smart and would 
not use the buffer overlap as intended.  This is a dead-end.

Using memcpy() in a loop as I suggested will work fine.  
Neal implemented a number of micro-optimizations using 
this and/or memchr().

You've already hit on the common case and best speed-up 
by special casing length one and by py_incref_n.  Go for it.

----------------------------------------------------------------------

Comment By: Gregory Smith (gregsmith)
Date: 2002-09-06 14:41

Message:
Logged In: YES 
user_id=292741

[ sound effects: can opener, worms... ] ;-)

Use of memcpy/memset is a whole other issue- see below

What you've got there doesn't help for the case I mention.
If you do []*100000, you get 100000 calls to memcpy with
size = 0, and if you do [None]*100000, 100000 calls to memcpy
with size = 4; this could easily wind up being even slower. The
point is to reduce the number of times the inner loop quits
and the outer loop runs; the actual code in the inner loop
will run the same number of times regardless (although 
*p = regvar is faster than *p = ptr->a[j] ); it's the loop 
overhead (instruction queue flushes, etc) that make a
difference here. Bear in mind too that 'j < a->ob_size' will
load
a->ob_size on each loop, if the compiler can't be sure
you didn't write it via *p [yes i''ve spent too much time
looking
at C compiler output. real-time DSP applications mostly].

Many compilers do intrinsic memcpy so that there's no
call, but since the length of the copy is not fixed, code
still needs to be generated to test the length, and you
wind up being no better than the original for short source
lists.

Regarding the use of memcpy/memset in general; these
could be applied in a lot of cases - making a list slice,
filling a new list with null's etc - and on some machines they
make a big difference to the speed of the operation. Compared
to the time spent later working on that list, it may not show
up much in the end. Having said that, I've done some python
apps which process, say, 20Mbytes of binary floating-point data
in about 12 seconds, by avoiding any python looping ( using
Numeric and image file read/writes). In this kind of thing,
a faster slice copy can make difference.

I'm assuming the programmer would rather not have the
marginal speedup of these routines, if it poses even a small
possibility of creating portability issues on some weird
machine,
and this is a pretty wise policy on a project of this nature.

I suspect any machine where memcpy doesn't work in this
context would also fail to work with python's polymorphism
and general memory management, but whatever.

A really good way to make a lot of copies of a small list is
to make the first copy, then do a single
  memcpy( p, p+size, (n-1)*size * sizeof(*p)   ) 
which 'rolls' the same memory over and over. No loops
(except within memcpy and the assumption is that that
one is as good as it gets).
 But this may be even less portable than the other uses of
memcpy, because it makes assumptions about
how memcpy handles overlapped transfers. E.g., machines
with a lot of registers could do a bunch of reads and
a bunch of writes to help out the cache, thus breaking
this for small input lists.

And yes, I'll report some speedup results when I have some.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-09-06 14:38

Message:
Logged In: YES 
user_id=80475

There was a typo in my code fragment.  It should have 
read:  
 memcpy(p + i*n, a->ob_item, a->ob_size * sizeof
(PyObject *));

The memcpy speeds up cases where there are many 
items in the list.  It is actually slower for a list of length 
one which should still be special cased as the OP 
suggested.  

Also, his idea for adding Py_INCREF_N is great.  It can be 
used throughout python.  Since Py_INCREF has lots of 
macro magic for updating debug statistics, the OP's idea 
may have to be implemented as a function.

I think there is interest in the OP's ideas and that he 
should go-ahead and develop a tested patch along with 
timing statistics.  

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-09-06 05:53

Message:
Logged In: YES 
user_id=21627

Can you report what speed-up you get, for what example?

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2002-09-04 21:45

Message:
Logged In: YES 
user_id=80475

Here's  a faster approach that doesn't involve a special 
case:

	for (i = 0; i < n; i++)
		memcpy(p + i*n, p, size);
	for (j = 0; j < a->ob_size; j++){
		Py_INCREF_N(*p, n);
		p++;
	}

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=604716&group_id=5470

[Python-bugs-list] [ python-Bugs-604716 ] faster [None]*n or []*n

[Python-bugs-list] [ python-Bugs-604716 ] faster [None]n or []n