Native byteorder representation
Hi, We have been bitten by a small glitch related with the representation of native byteorders. Here is an example exposing the problem:
numpy.dtype('<i4').byteorder '=' numpy.dtype('>i4').newbyteorder('little').byteorder '<'
[the example was run on a little endian machine] We thought that native byteorder were represented always by a '=', and this is true when you create the type from scratch. But, if you create a dtype with a different byteorder and then switch to a native one (in this case, 'little') the representation of the byteorder changes to '<' instead of '='. We can live with this, but IMO it would be better if the final representation of native byteorders could always be made to read '='. Thanks, --
0,0< Francesc Altet http://www.carabos.com/ V V Cárabos Coop. V. Enjoy Data "-"
Francesc Altet wrote:
Hi,
We have been bitten by a small glitch related with the representation of native byteorders. Here is an example exposing the problem:
numpy.dtype('<i4').byteorder
'='
numpy.dtype('>i4').newbyteorder('little').byteorder
'<'
This is somewhat inconsistent. But, I'm not sure it's worth changing. In the second case, you request a "little" byteorder data-type. Keeping this as '<' seems O.K. One could instead ask why the first example did not report a byte-order of "<" when that's what was explicitly asked for. -Travis
El dv 02 de 02 del 2007 a les 10:22 -0700, en/na Travis Oliphant va escriure:
Francesc Altet wrote:
Hi,
We have been bitten by a small glitch related with the representation of native byteorders. Here is an example exposing the problem:
numpy.dtype('<i4').byteorder
'='
numpy.dtype('>i4').newbyteorder('little').byteorder
'<'
This is somewhat inconsistent. But, I'm not sure it's worth changing.
In the second case, you request a "little" byteorder data-type. Keeping this as '<' seems O.K.
One could instead ask why the first example did not report a byte-order of "<" when that's what was explicitly asked for.
Well, just because of the same reason that numpy.dtype('<i4').byteorder returns a '=' instead of a '<' (the latter being explicitely set in the constructor). I think that returning a '=' whenever the byteorder is the same than the underlying machine is desirable because the user can quickly see whether her data is in native order or not. Cheers, -- Francesc Altet | Be careful about using the following code -- Carabos Coop. V. | I've only proven that it works, www.carabos.com | I haven't tested it. -- Donald Knuth
El dv 02 de 02 del 2007 a les 19:11 +0100, en/na Francesc Altet va escriure:
El dv 02 de 02 del 2007 a les 10:22 -0700, en/na Travis Oliphant va escriure:
Francesc Altet wrote:
Hi,
We have been bitten by a small glitch related with the representation of native byteorders. Here is an example exposing the problem:
numpy.dtype('<i4').byteorder
'='
numpy.dtype('>i4').newbyteorder('little').byteorder
'<'
This is somewhat inconsistent. But, I'm not sure it's worth changing.
In the second case, you request a "little" byteorder data-type. Keeping this as '<' seems O.K.
One could instead ask why the first example did not report a byte-order of "<" when that's what was explicitly asked for.
Well, just because of the same reason that
numpy.dtype('<i4').byteorder
returns a '=' instead of a '<' (the latter being explicitely set in the constructor).
Ops. I was confused about what you was saying here, sorry. Forget this.
I think that returning a '=' whenever the byteorder is the same than the underlying machine is desirable because the user can quickly see whether her data is in native order or not.
I think that this is the only reason I can argue. -- Francesc Altet | Be careful about using the following code -- Carabos Coop. V. | I've only proven that it works, www.carabos.com | I haven't tested it. -- Donald Knuth
On 2/3/07, Stephen Simmons <mail@stevesimmons.com> wrote:
Does anyone know why there is an order of magnitude difference in the speed of numpy's array.sum() function depending on the axis of the matrix summed?
To see this, import numpy and create a big array with two rows: >>> import numpy >>> a = numpy.ones([2,1000000], 'f4')
Then using ipython's timeit function: Time (ms) sum(a) 20 a.sum() 9 a.sum(axis=1) 9 a.sum(axis=0) 159 numpy.dot(numpy.ones(a.shape[0], a.dtype), a) 15
This last one using a dot product is functionally equivalent to a.sum(axis=0), suggesting that the slowdown is due to how indexing is implemented in array.sum().
I don't know how much time this would account for, but a.sum(0) has to create a much larger array than a.sum(1) does.
Keith Goodman wrote:
On 2/3/07, Stephen Simmons <mail@stevesimmons.com> wrote:
Does anyone know why there is an order of magnitude difference in the speed of numpy's array.sum() function depending on the axis of the matrix summed?
To see this, import numpy and create a big array with two rows: >>> import numpy >>> a = numpy.ones([2,1000000], 'f4')
Then using ipython's timeit function: Time (ms) sum(a) 20 a.sum() 9 a.sum(axis=1) 9 a.sum(axis=0) 159 numpy.dot(numpy.ones(a.shape[0], a.dtype), a) 15
This last one using a dot product is functionally equivalent to a.sum(axis=0), suggesting that the slowdown is due to how indexing is implemented in array.sum().
I don't know how much time this would account for, but a.sum(0) has to create a much larger array than a.sum(1) does.
However, so does sum(a) and numpy.dot(). -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On 2/3/07, Robert Kern <robert.kern@gmail.com> wrote:
Keith Goodman wrote:
On 2/3/07, Stephen Simmons <mail@stevesimmons.com> wrote:
Does anyone know why there is an order of magnitude difference in the speed of numpy's array.sum() function depending on the axis of the matrix summed?
To see this, import numpy and create a big array with two rows: >>> import numpy >>> a = numpy.ones([2,1000000], 'f4')
Then using ipython's timeit function: Time (ms) sum(a) 20 a.sum() 9 a.sum(axis=1) 9 a.sum(axis=0) 159 numpy.dot(numpy.ones(a.shape[0], a.dtype), a) 15
This last one using a dot product is functionally equivalent to a.sum(axis=0), suggesting that the slowdown is due to how indexing is implemented in array.sum().
I don't know how much time this would account for, but a.sum(0) has to create a much larger array than a.sum(1) does.
However, so does sum(a) and numpy.dot().
The speed difference across axis 0 and 1 is also seen in Octave and Matlab (but it is more like a factor of 5). But in those languages axis=0 is much faster. And numpy, if I remember, stores arrays in the opposite way as Octave (by row or column, I forget). So a lot of the speed difference could be in how the array is stored. http://velveeta.che.wisc.edu/octave/lists/help-octave/2005/2195 http://velveeta.che.wisc.edu/octave/lists/help-octave/2005/1912 http://velveeta.che.wisc.edu/octave/lists/help-octave/2005/1897
On 2/3/07, Stephen Simmons <mail@stevesimmons.com> wrote:
Hi,
Does anyone know why there is an order of magnitude difference in the speed of numpy's array.sum() function depending on the axis of the matrix summed?
To see this, import numpy and create a big array with two rows:
import numpy a = numpy.ones([2,1000000], 'f4')
Then using ipython's timeit function: Time (ms) sum(a) 20 a.sum() 9 a.sum(axis=1) 9 a.sum(axis=0) 159 numpy.dot(numpy.ones(a.shape[0], a.dtype), a) 15
This last one using a dot product is functionally equivalent to a.sum(axis=0), suggesting that the slowdown is due to how indexing is implemented in array.sum().
In this case it is expected. There are inner and outer loops, in the slow case the inner loop with its extra code is called 1000000 times, in the fast case, twice. On the other hand, note this: In [10]: timeit a[0,:] + a[1,:] 100 loops, best of 3: 19.7 ms per loop Which has only one loop. Caching could also be a problem, but in this case it is dominated by loop overhead. Chuck
On 2/3/07, Charles R Harris <charlesr.harris@gmail.com> wrote:
On 2/3/07, Stephen Simmons <mail@stevesimmons.com> wrote:
Hi,
Does anyone know why there is an order of magnitude difference in the speed of numpy's array.sum() function depending on the axis of the matrix summed?
To see this, import numpy and create a big array with two rows:
import numpy a = numpy.ones([2,1000000], 'f4')
Then using ipython's timeit function: Time (ms) sum(a) 20 a.sum() 9 a.sum(axis=1) 9 a.sum(axis=0) 159 numpy.dot(numpy.ones(a.shape[0], a.dtype), a) 15
This last one using a dot product is functionally equivalent to a.sum(axis=0), suggesting that the slowdown is due to how indexing is implemented in array.sum().
In this case it is expected. There are inner and outer loops, in the slow case the inner loop with its extra code is called 1000000 times, in the fast case, twice. On the other hand, note this:
In [10]: timeit a[0,:] + a[1,:] 100 loops, best of 3: 19.7 ms per loop
Which has only one loop. Caching could also be a problem, but in this case it is dominated by loop overhead.
PS, I think this indicate that the code would run faster in this case if it accumulated along the last axis, one at a time for each leading index. I suspect that the current implementation accumulates down the first axis, then repeats for each of the last indices. This shows that rearranging the way the accumulation is done could be a big gain, especially if the largest axis is chosen. Chuck Chuck
Charles R Harris wrote:
On 2/3/07, *Stephen Simmons* <mail@stevesimmons.com <mailto:mail@stevesimmons.com>> wrote:
Hi,
Does anyone know why there is an order of magnitude difference in the speed of numpy's array.sum() function depending on the axis of the matrix summed?
To see this, import numpy and create a big array with two rows: >>> import numpy >>> a = numpy.ones([2,1000000], 'f4')
Then using ipython's timeit function: Time (ms) sum(a) 20 a.sum() 9 a.sum(axis=1) 9 a.sum(axis=0) 159 numpy.dot(numpy.ones(a.shape[0], a.dtype), a) 15
This last one using a dot product is functionally equivalent to a.sum(axis=0), suggesting that the slowdown is due to how indexing is implemented in array.sum().
In this case it is expected. There are inner and outer loops, in the slow case the inner loop with its extra code is called 1000000 times, in the fast case, twice. On the other hand, note this:
In [10]: timeit a[0,:] + a[1,:] 100 loops, best of 3: 19.7 ms per loop
Which has only one loop. Caching could also be a problem, but in this case it is dominated by loop overhead.
Chuck
I agree that summing along the longer axis is most probably slower because it makes more passes through the inner loop. The question though is whether all of the inner loop's overhead is necessary. My counterexample using numpy.dot() suggests there's considerable scope for improvement, at least for certain common cases.
Stephen Simmons wrote:
The question though is whether all of the inner loop's overhead is necessary. My counterexample using numpy.dot() suggests there's considerable scope for improvement, at least for certain common cases.
Well, yes. You most likely have an ATLAS-accelerated dot(). The ATLAS put a lot of work into making matrix products really fast. However, they did so at a cost: different architectures use different code. That's not really something we can do in the core of numpy without making numpy as difficult to build as ATLAS is. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On 2/3/07, Robert Kern <robert.kern@gmail.com> wrote:
Stephen Simmons wrote:
The question though is whether all of the inner loop's overhead is necessary. My counterexample using numpy.dot() suggests there's considerable scope for improvement, at least for certain common cases.
Well, yes. You most likely have an ATLAS-accelerated dot(). The ATLAS put a lot of work into making matrix products really fast. However, they did so at a cost: different architectures use different code. That's not really something we can do in the core of numpy without making numpy as difficult to build as ATLAS is.
Maybe this argument could be inverted: maybe numpy could check if ATLAS is installed and automatically switch to the numpy.dot(numpy.ones(a.shape[0], a.dtype), a) variant that Stephen suggested. Of course -- as I see it -- the numpy.ones(...) part requires lots of extra memory. Maybe there are other downsides ... !? -Sebastian
On 2/4/07, Sebastian Haase <haase@msg.ucsf.edu> wrote:
On 2/3/07, Robert Kern <robert.kern@gmail.com> wrote:
Stephen Simmons wrote:
The question though is whether all of the inner loop's overhead is necessary. My counterexample using numpy.dot() suggests there's considerable scope for improvement, at least for certain common cases.
Well, yes. You most likely have an ATLAS-accelerated dot(). The ATLAS put a lot of work into making matrix products really fast. However, they did so at a cost: different architectures use different code. That's not really something we can do in the core of numpy without making numpy as difficult to build as ATLAS is.
Maybe this argument could be inverted: maybe numpy could check if ATLAS is installed and automatically switch to the numpy.dot(numpy.ones(a.shape[0], a.dtype), a) variant that Stephen suggested.
Of course -- as I see it -- the numpy.ones(...) part requires lots of extra memory. Maybe there are other downsides ... !?
I use multiplication instead of sum in heavily used loops. I'm often able to predefine the ones outside the loop. In Octave I made my own sum functions---separate ones for axis 0 and 1---that use multiplication. Maybe it is better to make a new function rather than complicate the existing one.
participants (7)
-
Charles R Harris -
Francesc Altet -
Keith Goodman -
Robert Kern -
Sebastian Haase -
Stephen Simmons -
Travis Oliphant