performance problem
Hi Travis, thanks for your incredible quick fixes. numpy is about 5 times slower than numarray (on my numarray-friendly bi-pentium):
from timeit import Timer import numarray; print numarray.__version__ 1.5.0 import numpy; print numpy.__version__ 0.9.5.2021 t1 = Timer('a <<= 8', 'import numarray as NX; a = NX.ones(10**6, NX.UInt32)') t2 = Timer('a <<= 8', 'import numpy as NX; a = NX.ones(10**6, NX.UInt32)') t1.timeit(100) 0.21813011169433594 t2.timeit(100) 1.1523458957672119
Numeric-23.1 is about as fast as numarray for inplace left shifts. Gerard
Gerard Vermeulen wrote:
Hi Travis, thanks for your incredible quick fixes.
numpy is about 5 times slower than numarray (on my numarray-friendly bi-pentium):
This is a coercion issue. NumPy is exercising the most complicated section of the ufunc code requiring buffered casting because you are asking it to combine a uint32 array with a signed scalar (so the only commensurable type is a signed scalar of type int64. This is then coerced back into the unsigned array). Look at the data-type of b = a << 8. (a<<8).dtype.name 'int64' Try a<<=val where val = uint32(8) is in the header You should see more commensurate times. We can of course discuss the appropriateness of the coercion rules. Basically scalars don't cause coercion unless they are of a different kind but as of now signed and unsigned integers are considered to be of different kinds. I think that there is a valid point to be made that all scalar integers should be treated the same since Python only has one way to enter an integer. Right now, this is not what's done, but it could be changed rather easily. -Travis
Gerard Vermeulen wrote:
coercion issue snipped
In current SVN, I think improved on the current state by only calling a scalar a signed integer if it is actually negative (previously only it's data-type was checked and all Python integers get converted to PyArray_LONG data-types which are signed integers). This fixes the immediate problem, I think. What are opinions on this scalar-coercion rule? -Travis
On Mon, 30 Jan 2006 01:52:53 -0700 Travis Oliphant <oliphant.travis@ieee.org> wrote:
Gerard Vermeulen wrote:
coercion issue snipped
In current SVN, I think improved on the current state by only calling a scalar a signed integer if it is actually negative (previously only it's data-type was checked and all Python integers get converted to PyArray_LONG data-types which are signed integers).
This fixes the immediate problem, I think.
What are opinions on this scalar-coercion rule?
Hmm, this is a consequence of your rule:
from numpy import *; core.__version__ '0.9.5.2024' a = arange(3, dtype=uint32) a-3 array([4294967293, 4294967294, 4294967295], dtype=uint32) a+-3 array([-3, -2, -1], dtype=int64) (a-3) == (a+-3) array([False, False, False], dtype=bool)
Do you think that the speed-up justifies this? I don't think so. All my performance issues are discovered while writing demo examples which transfer data between a Python wrapped C++-library and Numeric, numarray, or numpy. In that state of mind, it surprises me when an uint32 ndarray gets coerced to an int64 ndarray. I rather prefer numarray's approach which raises an overflow error for the
a+-3 above.
Agreed that sometimes a silent coercion is a good thing, but somebody who has started with an ubyte ndarray is likely to want an ubyte array in the end. I don't want to start a flame war, happy to write a - uint32(3) for numpy specific code. Gerard
Gerard Vermeulen wrote:
In current SVN, I think improved on the current state by only calling a scalar a signed integer if it is actually negative (previously only it's data-type was checked and all Python integers get converted to PyArray_LONG data-types which are signed integers).
This fixes the immediate problem, I think.
What are opinions on this scalar-coercion rule?
Hmm, this is a consequence of your rule:
from numpy import *; core.__version__
'0.9.5.2024'
a = arange(3, dtype=uint32) a-3
array([4294967293, 4294967294, 4294967295], dtype=uint32)
a+-3
array([-3, -2, -1], dtype=int64)
(a-3) == (a+-3)
array([False, False, False], dtype=bool)
Do you think that the speed-up justifies this? I don't think so.
It's still hard to say if it justifies it or not. One way of writing a-3 causes automatic upcasting while the other way doesn't. This might be a good thing, depending on your point of view. I could see people needing both situations. These things are never as clear as we'd like them to be. But, I could also accept a rule that treated *all* integers as the same kind in which case a-3 and a+(-3) would always return the same thing. I'm fine with it either way. So what are other opinions? -Travis
Gerard Vermeulen wrote:
0.9.5.2021
t1 = Timer('a <<= 8', 'import numarray as NX; a = NX.ones(10**6, NX.UInt32)') t2 = Timer('a <<= 8', 'import numpy as NX; a = NX.ones(10**6, NX.UInt32)') t1.timeit(100)
0.21813011169433594
t2.timeit(100)
1.1523458957672119
While ultimately, this slow-down was related to a coercion issue, I did still wonder about the extra dereference in the 1-d loop when one of the inputs is a scalar. So, I added a patch that checks for that case and defines a different loop. It seemed to give a small performance boost on my system. I'm wondering if such special-case coding is wise in general. Are there other ways to get C-compilers to produce faster code on modern machines? -Travis
On 1/30/06, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Are there other ways to get C-compilers to produce faster code on modern machines?
I would recommend to take a look at <http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf> if you have not seen it before. Although wtitten by AMD, many recommendations apply to most modern CPUs. I've found Chapter 3 particulary informative. In fact I've changed my coding habits after reading some of their recommendations (for example "Use Array-Style Instead of Pointer-Style Code"). -- sasha
Sasha wrote:
I would recommend to take a look at <http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf>
Nice reference, thanks. From that: """ Copy Frequently Dereferenced Pointer Arguments to Local Variables: Avoid frequently dereferencing pointer arguments inside a function. Since the compiler has no knowledge of whether aliasing exists between the pointers, such dereferencing cannot be optimized away by the compiler. This prevents data from being kept in registers and significantly increases memory traffic. Note that many compilers have an “assume no aliasing” optimization switch. This allows the compiler to assume that two different pointers always have disjoint contents and does not require copying of pointer arguments to local variables. Otherwise, copy the data pointed to by the pointer arguments to local variables at the start of the function and if necessary copy them back at the end of the function. """ Which perhaps helps answer Travis' original question. Did it make much difference in this case, Travis? -Chris -- Christopher Barker, Ph.D. Oceanographer NOAA/OR&R/HAZMAT (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
Christopher Barker wrote:
Which perhaps helps answer Travis' original question.
Did it make much difference in this case, Travis?
Some difference in that case. For 10**6 elements, the relevant loops went from about 34 msec/loop (using the timeit module) to about 31 msec/loop. For a savings of about 3 msec/loop on my AMD platform. -Travis
participants (5)
-
Christopher Barker
-
Gerard Vermeulen
-
Sasha
-
Travis Oliphant
-
Travis Oliphant