performance problem

Hi Travis, thanks for your incredible quick fixes.
numpy is about 5 times slower than numarray (on my numarray-friendly bi-pentium):
from timeit import Timer import numarray; print numarray.__version__
1.5.0
import numpy; print numpy.__version__
0.9.5.2021
t1 = Timer('a <<= 8', 'import numarray as NX; a = NX.ones(10**6, NX.UInt32)') t2 = Timer('a <<= 8', 'import numpy as NX; a = NX.ones(10**6, NX.UInt32)') t1.timeit(100)
0.21813011169433594
t2.timeit(100)
1.1523458957672119
Numeric-23.1 is about as fast as numarray for inplace left shifts.
Gerard

Gerard Vermeulen wrote:
Hi Travis, thanks for your incredible quick fixes.
numpy is about 5 times slower than numarray (on my numarray-friendly bi-pentium):
This is a coercion issue.
NumPy is exercising the most complicated section of the ufunc code requiring buffered casting because you are asking it to combine a uint32 array with a signed scalar (so the only commensurable type is a signed scalar of type int64. This is then coerced back into the unsigned array).
Look at the data-type of b = a << 8.
(a<<8).dtype.name 'int64'
Try
a<<=val where val = uint32(8) is in the header
You should see more commensurate times.
We can of course discuss the appropriateness of the coercion rules. Basically scalars don't cause coercion unless they are of a different kind but as of now signed and unsigned integers are considered to be of different kinds. I think that there is a valid point to be made that all scalar integers should be treated the same since Python only has one way to enter an integer.
Right now, this is not what's done, but it could be changed rather easily.
-Travis

Gerard Vermeulen wrote:
coercion issue snipped
In current SVN, I think improved on the current state by only calling a scalar a signed integer if it is actually negative (previously only it's data-type was checked and all Python integers get converted to PyArray_LONG data-types which are signed integers).
This fixes the immediate problem, I think.
What are opinions on this scalar-coercion rule?
-Travis

On Mon, 30 Jan 2006 01:52:53 -0700 Travis Oliphant oliphant.travis@ieee.org wrote:
Gerard Vermeulen wrote:
coercion issue snipped
In current SVN, I think improved on the current state by only calling a scalar a signed integer if it is actually negative (previously only it's data-type was checked and all Python integers get converted to PyArray_LONG data-types which are signed integers).
This fixes the immediate problem, I think.
What are opinions on this scalar-coercion rule?
Hmm, this is a consequence of your rule:
from numpy import *; core.__version__
'0.9.5.2024'
a = arange(3, dtype=uint32) a-3
array([4294967293, 4294967294, 4294967295], dtype=uint32)
a+-3
array([-3, -2, -1], dtype=int64)
(a-3) == (a+-3)
array([False, False, False], dtype=bool)
Do you think that the speed-up justifies this? I don't think so.
All my performance issues are discovered while writing demo examples which transfer data between a Python wrapped C++-library and Numeric, numarray, or numpy. In that state of mind, it surprises me when an uint32 ndarray gets coerced to an int64 ndarray.
I rather prefer numarray's approach which raises an overflow error for the
a+-3
above.
Agreed that sometimes a silent coercion is a good thing, but somebody who has started with an ubyte ndarray is likely to want an ubyte array in the end.
I don't want to start a flame war, happy to write a - uint32(3) for numpy specific code.
Gerard

Gerard Vermeulen wrote:
In current SVN, I think improved on the current state by only calling a scalar a signed integer if it is actually negative (previously only it's data-type was checked and all Python integers get converted to PyArray_LONG data-types which are signed integers).
This fixes the immediate problem, I think.
What are opinions on this scalar-coercion rule?
Hmm, this is a consequence of your rule:
from numpy import *; core.__version__
'0.9.5.2024'
a = arange(3, dtype=uint32) a-3
array([4294967293, 4294967294, 4294967295], dtype=uint32)
a+-3
array([-3, -2, -1], dtype=int64)
(a-3) == (a+-3)
array([False, False, False], dtype=bool)
Do you think that the speed-up justifies this? I don't think so.
It's still hard to say if it justifies it or not. One way of writing a-3 causes automatic upcasting while the other way doesn't. This might be a good thing, depending on your point of view. I could see people needing both situations. These things are never as clear as we'd like them to be.
But, I could also accept a rule that treated *all* integers as the same kind in which case a-3 and a+(-3) would always return the same thing.
I'm fine with it either way. So what are other opinions?
-Travis

Gerard Vermeulen wrote:
0.9.5.2021
t1 = Timer('a <<= 8', 'import numarray as NX; a = NX.ones(10**6, NX.UInt32)') t2 = Timer('a <<= 8', 'import numpy as NX; a = NX.ones(10**6, NX.UInt32)') t1.timeit(100)
0.21813011169433594
t2.timeit(100)
1.1523458957672119
While ultimately, this slow-down was related to a coercion issue, I did still wonder about the extra dereference in the 1-d loop when one of the inputs is a scalar. So, I added a patch that checks for that case and defines a different loop.
It seemed to give a small performance boost on my system. I'm wondering if such special-case coding is wise in general. Are there other ways to get C-compilers to produce faster code on modern machines?
-Travis

On 1/30/06, Travis Oliphant oliphant@ee.byu.edu wrote:
Are there other ways to get C-compilers to produce faster code on modern machines?
I would recommend to take a look at http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf if you have not seen it before. Although wtitten by AMD, many recommendations apply to most modern CPUs. I've found Chapter 3 particulary informative. In fact I've changed my coding habits after reading some of their recommendations (for example "Use Array-Style Instead of Pointer-Style Code").
-- sasha

Sasha wrote:
I would recommend to take a look at http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
Nice reference, thanks.
From that:
""" Copy Frequently Dereferenced Pointer Arguments to Local Variables:
Avoid frequently dereferencing pointer arguments inside a function. Since the compiler has no knowledge of whether aliasing exists between the pointers, such dereferencing cannot be optimized away by the compiler. This prevents data from being kept in registers and significantly increases memory traffic.
Note that many compilers have an “assume no aliasing” optimization switch. This allows the compiler to assume that two different pointers always have disjoint contents and does not require copying of pointer arguments to local variables. Otherwise, copy the data pointed to by the pointer arguments to local variables at the start of the function and if necessary copy them back at the end of the function. """
Which perhaps helps answer Travis' original question.
Did it make much difference in this case, Travis?
-Chris

Christopher Barker wrote:
Which perhaps helps answer Travis' original question.
Did it make much difference in this case, Travis?
Some difference in that case. For 10**6 elements, the relevant loops went from about 34 msec/loop (using the timeit module) to about 31 msec/loop.
For a savings of about 3 msec/loop on my AMD platform.
-Travis
participants (5)
-
Christopher Barker
-
Gerard Vermeulen
-
Sasha
-
Travis Oliphant
-
Travis Oliphant