Mailman 3 performance problem - NumPy-Discussion

newer
NumPy installed but aborted when...

performance problem

older
Re: [Numpy-discussion] Numerical...

Gerard Vermeulen

28 Jan 2006 28 Jan '06

11 p.m.

Hi Travis, thanks for your incredible quick fixes. numpy is about 5 times slower than numarray (on my numarray-friendly bi-pentium):

...

...
...
from timeit import Timer import numarray; print numarray.__version__ 1.5.0 import numpy; print numpy.__version__ 0.9.5.2021 t1 = Timer('a <<= 8', 'import numarray as NX; a = NX.ones(10**6, NX.UInt32)') t2 = Timer('a <<= 8', 'import numpy as NX; a = NX.ones(10**6, NX.UInt32)') t1.timeit(100) 0.21813011169433594 t2.timeit(100) 1.1523458957672119

Numeric-23.1 is about as fast as numarray for inplace left shifts. Gerard

Show replies by date

Travis Oliphant

29 Jan 29 Jan

6:24 p.m.

Gerard Vermeulen wrote:

...

Hi Travis, thanks for your incredible quick fixes.

numpy is about 5 times slower than numarray (on my numarray-friendly bi-pentium):

This is a coercion issue. NumPy is exercising the most complicated section of the ufunc code requiring buffered casting because you are asking it to combine a uint32 array with a signed scalar (so the only commensurable type is a signed scalar of type int64. This is then coerced back into the unsigned array). Look at the data-type of b = a << 8. (a<<8).dtype.name 'int64' Try a<<=val where val = uint32(8) is in the header You should see more commensurate times. We can of course discuss the appropriateness of the coercion rules. Basically scalars don't cause coercion unless they are of a different kind but as of now signed and unsigned integers are considered to be of different kinds. I think that there is a valid point to be made that all scalar integers should be treated the same since Python only has one way to enter an integer. Right now, this is not what's done, but it could be changed rather easily. -Travis

Travis Oliphant

6:53 p.m.

Gerard Vermeulen wrote:

...

coercion issue snipped

In current SVN, I think improved on the current state by only calling a scalar a signed integer if it is actually negative (previously only it's data-type was checked and all Python integers get converted to PyArray_LONG data-types which are signed integers). This fixes the immediate problem, I think. What are opinions on this scalar-coercion rule? -Travis

Gerard Vermeulen

30 Jan 30 Jan

7:52 a.m.

On Mon, 30 Jan 2006 01:52:53 -0700 Travis Oliphant <oliphant.travis@ieee.org> wrote:

...

Gerard Vermeulen wrote:

...
coercion issue snipped

In current SVN, I think improved on the current state by only calling a scalar a signed integer if it is actually negative (previously only it's data-type was checked and all Python integers get converted to PyArray_LONG data-types which are signed integers).

This fixes the immediate problem, I think.

What are opinions on this scalar-coercion rule?

Hmm, this is a consequence of your rule:

...

...
...
from numpy import *; core.__version__ '0.9.5.2024' a = arange(3, dtype=uint32) a-3 array([4294967293, 4294967294, 4294967295], dtype=uint32) a+-3 array([-3, -2, -1], dtype=int64) (a-3) == (a+-3) array([False, False, False], dtype=bool)

Do you think that the speed-up justifies this? I don't think so. All my performance issues are discovered while writing demo examples which transfer data between a Python wrapped C++-library and Numeric, numarray, or numpy. In that state of mind, it surprises me when an uint32 ndarray gets coerced to an int64 ndarray. I rather prefer numarray's approach which raises an overflow error for the

...

...
...
a+-3 above.

Agreed that sometimes a silent coercion is a good thing, but somebody who has started with an ubyte ndarray is likely to want an ubyte array in the end. I don't want to start a flame war, happy to write a - uint32(3) for numpy specific code. Gerard

Travis Oliphant

7:59 a.m.

Gerard Vermeulen wrote:

...

...
In current SVN, I think improved on the current state by only calling a scalar a signed integer if it is actually negative (previously only it's data-type was checked and all Python integers get converted to PyArray_LONG data-types which are signed integers).

This fixes the immediate problem, I think.

What are opinions on this scalar-coercion rule?

Hmm, this is a consequence of your rule:

...
...
...
from numpy import *; core.__version__

'0.9.5.2024'

...
...
...
a = arange(3, dtype=uint32) a-3

array([4294967293, 4294967294, 4294967295], dtype=uint32)

...
...
...
a+-3

array([-3, -2, -1], dtype=int64)

...
...
...
(a-3) == (a+-3)

array([False, False, False], dtype=bool)

Do you think that the speed-up justifies this? I don't think so.

It's still hard to say if it justifies it or not. One way of writing a-3 causes automatic upcasting while the other way doesn't. This might be a good thing, depending on your point of view. I could see people needing both situations. These things are never as clear as we'd like them to be. But, I could also accept a rule that treated *all* integers as the same kind in which case a-3 and a+(-3) would always return the same thing. I'm fine with it either way. So what are other opinions? -Travis

Travis Oliphant

7:34 a.m.

Gerard Vermeulen wrote:

...

0.9.5.2021

...
...
...
t1 = Timer('a <<= 8', 'import numarray as NX; a = NX.ones(10**6, NX.UInt32)') t2 = Timer('a <<= 8', 'import numpy as NX; a = NX.ones(10**6, NX.UInt32)') t1.timeit(100)

0.21813011169433594

...
...
...
t2.timeit(100)

1.1523458957672119

While ultimately, this slow-down was related to a coercion issue, I did still wonder about the extra dereference in the 1-d loop when one of the inputs is a scalar. So, I added a patch that checks for that case and defines a different loop. It seemed to give a small performance boost on my system. I'm wondering if such special-case coding is wise in general. Are there other ways to get C-compilers to produce faster code on modern machines? -Travis

Sasha

8:39 a.m.

On 1/30/06, Travis Oliphant <oliphant@ee.byu.edu> wrote:

...

Are there other ways to get C-compilers to produce faster code on modern machines?

I would recommend to take a look at <http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf> if you have not seen it before. Although wtitten by AMD, many recommendations apply to most modern CPUs. I've found Chapter 3 particulary informative. In fact I've changed my coding habits after reading some of their recommendations (for example "Use Array-Style Instead of Pointer-Style Code"). -- sasha

Christopher Barker

10:48 a.m.

Sasha wrote:

...

I would recommend to take a look at <http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf>

Nice reference, thanks. From that: """ Copy Frequently Dereferenced Pointer Arguments to Local Variables: Avoid frequently dereferencing pointer arguments inside a function. Since the compiler has no knowledge of whether aliasing exists between the pointers, such dereferencing cannot be optimized away by the compiler. This prevents data from being kept in registers and significantly increases memory traffic. Note that many compilers have an “assume no aliasing” optimization switch. This allows the compiler to assume that two different pointers always have disjoint contents and does not require copying of pointer arguments to local variables. Otherwise, copy the data pointed to by the pointer arguments to local variables at the start of the function and if necessary copy them back at the end of the function. """ Which perhaps helps answer Travis' original question. Did it make much difference in this case, Travis? -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Travis Oliphant

1:29 p.m.

Christopher Barker wrote:

...

Which perhaps helps answer Travis' original question.

Did it make much difference in this case, Travis?

Some difference in that case. For 10**6 elements, the relevant loops went from about 34 msec/loop (using the timeit module) to about 31 msec/loop. For a savings of about 3 msec/loop on my AMD platform. -Travis

6858

Age (days ago)

6859

Last active (days ago)

List overview

Download

8 comments

5 participants

participants (5)

Christopher Barker
Gerard Vermeulen
Sasha
Travis Oliphant
Travis Oliphant

performance problem

Gerard Vermeulen

Travis Oliphant

Travis Oliphant

Gerard Vermeulen

Travis Oliphant

Travis Oliphant

Sasha

Christopher Barker

Travis Oliphant

tags

participants (5)