Hi,
I've written a snippet of code that we could call scipy.dot, a drop-in replacement for numpy.dot. It's dead easy, and just answer the need of calling the right blas function depending on the type of arrays, C or F order (aka slowness of np.dot(A, A.T))
While this is not the scipy mailing list, I was wondering if this snippet would relevant and/or useful to others, numpy folks, scipy folks or could be integrated directly in numpy (so that we keep the nice A.dot(B) syntax) This bottleneck of temporary copies has been a problem for lots of users and it seems everybody has their own snippets. This code is probably not written as it should, I hope the community can help improving it! ;) First FIXME is to make it work for arrays of dimensions other than 2.
Suggestions highly appreciated!
Thanks!
=== Code (also on http://pastebin.com/QrRk0kEf)
def dot(A, B, out=None): """ A drop in replaement for numpy.dot Computes A.B optimized using fblas call note: unlike in numpy the returned array is in F order""" import scipy.linalg as sp gemm = sp.get_blas_funcs('gemm', arrays=(A,B))
if out is None: lda, x, y, ldb = A.shape + B.shape if x != y: raise ValueError("matrices are not aligned") dtype = np.max([x.dtype for x in (A, B)]) out = np.empty((lda, ldb), dtype, order='F')
if A.flags.c_contiguous and B.flags.c_contiguous: gemm(alpha=1., a=A.T, b=B.T, trans_a=True, trans_b=True, c=out, overwrite_c=True) if A.flags.c_contiguous and B.flags.f_contiguous: gemm(alpha=1., a=A.T, b=B, trans_a=True, c=out, overwrite_c=True) if A.flags.f_contiguous and B.flags.c_contiguous: gemm(alpha=1., a=A, b=B.T, trans_b=True, c=out, overwrite_c=True) if A.flags.f_contiguous and B.flags.f_contiguous: gemm(alpha=1., a=A, b=B, c=out, overwrite_c=True) return out ==
Timing (EPD, MKL):
In [15]: A = np.array(np.random.randn(1000, 1000), 'f')
In [16]: %timeit np.dot(A, A) 100 loops, best of 3: 7.19 ms per loop
In [17]: %timeit np.dot(A.T, A.T) 10 loops, best of 3: 27.7 ms per loop
In [18]: %timeit np.dot(A, A.T) 100 loops, best of 3: 18.3 ms per loop
In [19]: %timeit np.dot(A.T, A) 100 loops, best of 3: 18.7 ms per loop
In [20]: %timeit dot(A, A) 100 loops, best of 3: 7.16 ms per loop
In [21]: %timeit dot(A.T, A.T) 100 loops, best of 3: 6.67 ms per loop
In [22]: %timeit dot(A, A.T) 100 loops, best of 3: 6.79 ms per loop
In [23]: %timeit dot(A.T, A) 100 loops, best of 3: 7.02 ms per loop
On Wed, Nov 7, 2012 at 10:41 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Hi,
I've written a snippet of code that we could call scipy.dot, a drop-in replacement for numpy.dot. It's dead easy, and just answer the need of calling the right blas function depending on the type of arrays, C or F order (aka slowness of np.dot(A, A.T))
While this is not the scipy mailing list, I was wondering if this snippet would relevant and/or useful to others, numpy folks, scipy folks or could be integrated directly in numpy (so that we keep the nice A.dot(B) syntax)
I think everyone would be very happy to see numpy.dot modified to do this automatically. But adding a scipy.dot IMHO would be fixing things in the wrong place and just create extra confusion.
Is it possible to avoid changing the default output order from C to F? (E.g. by transposing everything relative to what you have now?) That seems like a change that would be good to avoid if it's easy.
-n
On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith wrote:
I think everyone would be very happy to see numpy.dot modified to do this automatically. But adding a scipy.dot IMHO would be fixing things in the wrong place and just create extra confusion.
I am not sure I agree: numpy is often compiled without lapack support, as it is not necessary. On the other hand scipy is always compiled with lapack. Thus this makes more sens in scipy.
Is it possible to avoid changing the default output order from C to F?
+1
G
On 11/08/2012 01:07 PM, Gael Varoquaux wrote:
On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith wrote:
I think everyone would be very happy to see numpy.dot modified to do this automatically. But adding a scipy.dot IMHO would be fixing things in the wrong place and just create extra confusion.
I am not sure I agree: numpy is often compiled without lapack support, as it is not necessary. On the other hand scipy is always compiled with lapack. Thus this makes more sens in scipy.
Well, numpy.dot already contains multiple fallback cases for when it is compiled with BLAS and not. So I'm +1 on just making this an improvement on numpy.dot. I don't think there's a time when you would not want to use this (provided the output order issue is fixed), and it doesn't make sense to not have old codes take advantage of the speed improvement.
BTW, something this doesn't fix is that you still have to do "np.dot(x.conjugate().t, x)" in the complex case which needlessly copies the data since LAPACK can do the conjugation.
DS
Is it possible to avoid changing the default output order from C to F?
+1
G _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn d.s.seljebotn@astro.uio.no wrote:
On 11/08/2012 01:07 PM, Gael Varoquaux wrote:
On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith wrote:
I think everyone would be very happy to see numpy.dot modified to do this automatically. But adding a scipy.dot IMHO would be fixing things in the wrong place and just create extra confusion.
I am not sure I agree: numpy is often compiled without lapack support, as it is not necessary. On the other hand scipy is always compiled with lapack. Thus this makes more sens in scipy.
Well, numpy.dot already contains multiple fallback cases for when it is compiled with BLAS and not. So I'm +1 on just making this an improvement on numpy.dot. I don't think there's a time when you would not want to use this (provided the output order issue is fixed), and it doesn't make sense to not have old codes take advantage of the speed improvement.
Indeed, there is no reason not to make this available in NumPy.
Nicolas, can you prepare a patch for numpy ?
David
On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau cournape@gmail.com wrote:
On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn d.s.seljebotn@astro.uio.no wrote:
On 11/08/2012 01:07 PM, Gael Varoquaux wrote:
On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith wrote:
I think everyone would be very happy to see numpy.dot modified to do this automatically. But adding a scipy.dot IMHO would be fixing things in the wrong place and just create extra confusion.
I am not sure I agree: numpy is often compiled without lapack support,
as
it is not necessary. On the other hand scipy is always compiled with lapack. Thus this makes more sens in scipy.
Well, numpy.dot already contains multiple fallback cases for when it is compiled with BLAS and not. So I'm +1 on just making this an improvement on numpy.dot. I don't think there's a time when you would not want to use this (provided the output order issue is fixed), and it doesn't make sense to not have old codes take advantage of the speed improvement.
Indeed, there is no reason not to make this available in NumPy.
Nicolas, can you prepare a patch for numpy ?
+1, I agree, this should be a fix in numpy, not scipy.
Be Well Anthony
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi,
I also think it should go into numpy.dot and that the output order should not be changed.
A new point, what about the additional overhead for small ndarray? To remove this, I would suggest to put this code into the C function that do the actual work (at least, from memory it is a c function, not a python one).
HTH
Fred
On Thu, Nov 8, 2012 at 12:29 PM, Anthony Scopatz scopatz@gmail.com wrote:
On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau cournape@gmail.comwrote:
On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn d.s.seljebotn@astro.uio.no wrote:
On 11/08/2012 01:07 PM, Gael Varoquaux wrote:
On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith wrote:
I think everyone would be very happy to see numpy.dot modified to do this automatically. But adding a scipy.dot IMHO would be fixing things in the wrong place and just create extra confusion.
I am not sure I agree: numpy is often compiled without lapack support,
as
it is not necessary. On the other hand scipy is always compiled with lapack. Thus this makes more sens in scipy.
Well, numpy.dot already contains multiple fallback cases for when it is compiled with BLAS and not. So I'm +1 on just making this an improvement on numpy.dot. I don't think there's a time when you would not want to use this (provided the output order issue is fixed), and it doesn't make sense to not have old codes take advantage of the speed improvement.
Indeed, there is no reason not to make this available in NumPy.
Nicolas, can you prepare a patch for numpy ?
+1, I agree, this should be a fix in numpy, not scipy.
Be Well Anthony
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Thanks for all the responses folks. This is indeed a nice problem to solve.
Few points: I. Change the order from 'F' to 'C': I'll look into it. II. Integration with scipy / numpy: opinions are diverging here. Let's wait a bit to get more responses on what people think. One thing though: I'd need the same functionality as get_blas_funcs in numpy. Since numpy does not require lapack, what functions can I get? III. Complex arrays I unfortunately don't have enough knowledge here. If someone could propose a fix, that'd be great. IV. C Writing this in C sounds like a good idea. I'm not sure I'd be the right person to this though. V. Patch in numpy I'd love to do that and learn to do it as a byproduct. Let's make sure we agree this can go in numpy first and that all FIXME can be fixed. Although I guess we can resolve fixmes using git.
Let me know how you'd like to proceed,
Thanks!
FIXMEs: - Fix for ndim != 2 - Fix for dtype == np.complex* - Fix order of output array
On Thu, Nov 8, 2012 at 9:42 AM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
I also think it should go into numpy.dot and that the output order should not be changed.
A new point, what about the additional overhead for small ndarray? To remove this, I would suggest to put this code into the C function that do the actual work (at least, from memory it is a c function, not a python one).
HTH
Fred
On Thu, Nov 8, 2012 at 12:29 PM, Anthony Scopatz scopatz@gmail.com wrote:
On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau cournape@gmail.com wrote:
On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn d.s.seljebotn@astro.uio.no wrote:
On 11/08/2012 01:07 PM, Gael Varoquaux wrote:
On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith wrote:
I think everyone would be very happy to see numpy.dot modified to do this automatically. But adding a scipy.dot IMHO would be fixing things in the wrong place and just create extra confusion.
I am not sure I agree: numpy is often compiled without lapack support, as it is not necessary. On the other hand scipy is always compiled with lapack. Thus this makes more sens in scipy.
Well, numpy.dot already contains multiple fallback cases for when it is compiled with BLAS and not. So I'm +1 on just making this an improvement on numpy.dot. I don't think there's a time when you would not want to use this (provided the output order issue is fixed), and it doesn't make sense to not have old codes take advantage of the speed improvement.
Indeed, there is no reason not to make this available in NumPy.
Nicolas, can you prepare a patch for numpy ?
+1, I agree, this should be a fix in numpy, not scipy.
Be Well Anthony
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I've made the necessary changes to get the proper order for the output array. Also, a pass of pep8 and some tests (fixmes are in failing tests) http://pastebin.com/M8TfbURi
-n
On Thu, Nov 8, 2012 at 11:38 AM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Thanks for all the responses folks. This is indeed a nice problem to solve.
Few points: I. Change the order from 'F' to 'C': I'll look into it. II. Integration with scipy / numpy: opinions are diverging here. Let's wait a bit to get more responses on what people think. One thing though: I'd need the same functionality as get_blas_funcs in numpy. Since numpy does not require lapack, what functions can I get? III. Complex arrays I unfortunately don't have enough knowledge here. If someone could propose a fix, that'd be great. IV. C Writing this in C sounds like a good idea. I'm not sure I'd be the right person to this though. V. Patch in numpy I'd love to do that and learn to do it as a byproduct. Let's make sure we agree this can go in numpy first and that all FIXME can be fixed. Although I guess we can resolve fixmes using git.
Let me know how you'd like to proceed,
Thanks!
FIXMEs:
- Fix for ndim != 2
- Fix for dtype == np.complex*
- Fix order of output array
On Thu, Nov 8, 2012 at 9:42 AM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
I also think it should go into numpy.dot and that the output order should not be changed.
A new point, what about the additional overhead for small ndarray? To remove this, I would suggest to put this code into the C function that do the actual work (at least, from memory it is a c function, not a python one).
HTH
Fred
On Thu, Nov 8, 2012 at 12:29 PM, Anthony Scopatz scopatz@gmail.com wrote:
On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau cournape@gmail.com wrote:
On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn d.s.seljebotn@astro.uio.no wrote:
On 11/08/2012 01:07 PM, Gael Varoquaux wrote:
On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith wrote: > I think everyone would be very happy to see numpy.dot modified to do > this automatically. But adding a scipy.dot IMHO would be fixing > things > in the wrong place and just create extra confusion.
I am not sure I agree: numpy is often compiled without lapack support, as it is not necessary. On the other hand scipy is always compiled with lapack. Thus this makes more sens in scipy.
Well, numpy.dot already contains multiple fallback cases for when it is compiled with BLAS and not. So I'm +1 on just making this an improvement on numpy.dot. I don't think there's a time when you would not want to use this (provided the output order issue is fixed), and it doesn't make sense to not have old codes take advantage of the speed improvement.
Indeed, there is no reason not to make this available in NumPy.
Nicolas, can you prepare a patch for numpy ?
+1, I agree, this should be a fix in numpy, not scipy.
Be Well Anthony
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Well, hinted by what Fabien said, I looked at the C level dot function. Quite verbose!
But starting line 757, we can see that it shouldn't be too much work to fix that bug (well there is even a comment there that states just that) https://github.com/numpy/numpy/blob/master/numpy/core/blasdot/_dotblas.c#L75... I now think that should be the cleanest.
This would only work for gemm though. I don't know what the benefit is for gemv for instance, but we should make that kind of changes everywhere we can. The evil PyArray_Copy is there twice and that's what we want to get rid of.
I'm not sure, but it looks to me that removing the copy and doing the following would do the work: Order = CblasRowMajor; Trans1 = CblasNoTrans; Trans2 = CblasNoTrans; if (!PyArray_ISCONTIGUOUS(ap1)) { Trans1 = CblasTrans; } if (!PyArray_ISCONTIGUOUS(ap2)) { Trans2 = CblasTrans; } might be too easy to be true.
On Thu, Nov 8, 2012 at 12:06 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I've made the necessary changes to get the proper order for the output array. Also, a pass of pep8 and some tests (fixmes are in failing tests) http://pastebin.com/M8TfbURi
-n
On Thu, Nov 8, 2012 at 11:38 AM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Thanks for all the responses folks. This is indeed a nice problem to solve.
Few points: I. Change the order from 'F' to 'C': I'll look into it. II. Integration with scipy / numpy: opinions are diverging here. Let's wait a bit to get more responses on what people think. One thing though: I'd need the same functionality as get_blas_funcs in numpy. Since numpy does not require lapack, what functions can I get? III. Complex arrays I unfortunately don't have enough knowledge here. If someone could propose a fix, that'd be great. IV. C Writing this in C sounds like a good idea. I'm not sure I'd be the right person to this though. V. Patch in numpy I'd love to do that and learn to do it as a byproduct. Let's make sure we agree this can go in numpy first and that all FIXME can be fixed. Although I guess we can resolve fixmes using git.
Let me know how you'd like to proceed,
Thanks!
FIXMEs:
- Fix for ndim != 2
- Fix for dtype == np.complex*
- Fix order of output array
On Thu, Nov 8, 2012 at 9:42 AM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
I also think it should go into numpy.dot and that the output order should not be changed.
A new point, what about the additional overhead for small ndarray? To remove this, I would suggest to put this code into the C function that do the actual work (at least, from memory it is a c function, not a python one).
HTH
Fred
On Thu, Nov 8, 2012 at 12:29 PM, Anthony Scopatz scopatz@gmail.com wrote:
On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau cournape@gmail.com wrote:
On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn d.s.seljebotn@astro.uio.no wrote:
On 11/08/2012 01:07 PM, Gael Varoquaux wrote: > On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith wrote: >> I think everyone would be very happy to see numpy.dot modified to do >> this automatically. But adding a scipy.dot IMHO would be fixing >> things >> in the wrong place and just create extra confusion. > > I am not sure I agree: numpy is often compiled without lapack support, > as > it is not necessary. On the other hand scipy is always compiled with > lapack. Thus this makes more sens in scipy.
Well, numpy.dot already contains multiple fallback cases for when it is compiled with BLAS and not. So I'm +1 on just making this an improvement on numpy.dot. I don't think there's a time when you would not want to use this (provided the output order issue is fixed), and it doesn't make sense to not have old codes take advantage of the speed improvement.
Indeed, there is no reason not to make this available in NumPy.
Nicolas, can you prepare a patch for numpy ?
+1, I agree, this should be a fix in numpy, not scipy.
Be Well Anthony
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hey,
On Thu, 2012-11-08 at 14:44 -0800, Nicolas SCHEFFER wrote:
Well, hinted by what Fabien said, I looked at the C level dot function. Quite verbose!
But starting line 757, we can see that it shouldn't be too much work to fix that bug (well there is even a comment there that states just that) https://github.com/numpy/numpy/blob/master/numpy/core/blasdot/_dotblas.c#L75... I now think that should be the cleanest.
This would only work for gemm though. I don't know what the benefit is for gemv for instance, but we should make that kind of changes everywhere we can. The evil PyArray_Copy is there twice and that's what we want to get rid of.
I'm not sure, but it looks to me that removing the copy and doing the following would do the work: Order = CblasRowMajor; Trans1 = CblasNoTrans; Trans2 = CblasNoTrans; if (!PyArray_ISCONTIGUOUS(ap1)) { Trans1 = CblasTrans; } if (!PyArray_ISCONTIGUOUS(ap2)) { Trans2 = CblasTrans; } might be too easy to be true.
Sounds nice, though don't forget that the array may also be neither C- or F-Contiguous, in which case you need a copy in any case. So it would probably be more like:
if (PyArray_IS_C_CONTIGUOUS(ap1)) { Trans1 = CblasNoTrans; } else if (PyArray_IS_F_CONTIGUOUS(ap1)) { Trans1 = CblasTrans; } else { Trans1 = CblasNoTrans; PyObject *new = PyArray_Copy(ap1); Py_DECREF(ap1); ap1 = (PyArrayObject *)new; }
Regards,
Sebastian
On Thu, Nov 8, 2012 at 12:06 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I've made the necessary changes to get the proper order for the output array. Also, a pass of pep8 and some tests (fixmes are in failing tests) http://pastebin.com/M8TfbURi
-n
On Thu, Nov 8, 2012 at 11:38 AM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Thanks for all the responses folks. This is indeed a nice problem to solve.
Few points: I. Change the order from 'F' to 'C': I'll look into it. II. Integration with scipy / numpy: opinions are diverging here. Let's wait a bit to get more responses on what people think. One thing though: I'd need the same functionality as get_blas_funcs in numpy. Since numpy does not require lapack, what functions can I get? III. Complex arrays I unfortunately don't have enough knowledge here. If someone could propose a fix, that'd be great. IV. C Writing this in C sounds like a good idea. I'm not sure I'd be the right person to this though. V. Patch in numpy I'd love to do that and learn to do it as a byproduct. Let's make sure we agree this can go in numpy first and that all FIXME can be fixed. Although I guess we can resolve fixmes using git.
Let me know how you'd like to proceed,
Thanks!
FIXMEs:
- Fix for ndim != 2
- Fix for dtype == np.complex*
- Fix order of output array
On Thu, Nov 8, 2012 at 9:42 AM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
I also think it should go into numpy.dot and that the output order should not be changed.
A new point, what about the additional overhead for small ndarray? To remove this, I would suggest to put this code into the C function that do the actual work (at least, from memory it is a c function, not a python one).
HTH
Fred
On Thu, Nov 8, 2012 at 12:29 PM, Anthony Scopatz scopatz@gmail.com wrote:
On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau cournape@gmail.com wrote:
On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn d.s.seljebotn@astro.uio.no wrote: > On 11/08/2012 01:07 PM, Gael Varoquaux wrote: >> On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith wrote: >>> I think everyone would be very happy to see numpy.dot modified to do >>> this automatically. But adding a scipy.dot IMHO would be fixing >>> things >>> in the wrong place and just create extra confusion. >> >> I am not sure I agree: numpy is often compiled without lapack support, >> as >> it is not necessary. On the other hand scipy is always compiled with >> lapack. Thus this makes more sens in scipy. > > Well, numpy.dot already contains multiple fallback cases for when it is > compiled with BLAS and not. So I'm +1 on just making this an > improvement > on numpy.dot. I don't think there's a time when you would not want to > use this (provided the output order issue is fixed), and it doesn't > make > sense to not have old codes take advantage of the speed improvement.
Indeed, there is no reason not to make this available in NumPy.
Nicolas, can you prepare a patch for numpy ?
+1, I agree, this should be a fix in numpy, not scipy.
Be Well Anthony
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, 2012-11-09 at 00:24 +0100, Sebastian Berg wrote:
Hey,
On Thu, 2012-11-08 at 14:44 -0800, Nicolas SCHEFFER wrote:
Well, hinted by what Fabien said, I looked at the C level dot function. Quite verbose!
But starting line 757, we can see that it shouldn't be too much work to fix that bug (well there is even a comment there that states just that) https://github.com/numpy/numpy/blob/master/numpy/core/blasdot/_dotblas.c#L75... I now think that should be the cleanest.
This would only work for gemm though. I don't know what the benefit is for gemv for instance, but we should make that kind of changes everywhere we can. The evil PyArray_Copy is there twice and that's what we want to get rid of.
I'm not sure, but it looks to me that removing the copy and doing the following would do the work: Order = CblasRowMajor; Trans1 = CblasNoTrans; Trans2 = CblasNoTrans; if (!PyArray_ISCONTIGUOUS(ap1)) { Trans1 = CblasTrans; } if (!PyArray_ISCONTIGUOUS(ap2)) { Trans2 = CblasTrans; } might be too easy to be true.
Sounds nice, though don't forget that the array may also be neither C- or F-Contiguous, in which case you need a copy in any case. So it would probably be more like:
if (PyArray_IS_C_CONTIGUOUS(ap1)) { Trans1 = CblasNoTrans; } else if (PyArray_IS_F_CONTIGUOUS(ap1)) { Trans1 = CblasTrans; } else { Trans1 = CblasNoTrans; PyObject *new = PyArray_Copy(ap1); Py_DECREF(ap1); ap1 = (PyArrayObject *)new; }
Well, of course I forgot error checking there, and maybe you need to set some of the other parameters differently, but it looks like its probably that easy, and I am sure everyone will welcome a PR with such changes.
Regards,
Sebastian
On Thu, Nov 8, 2012 at 12:06 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I've made the necessary changes to get the proper order for the output array. Also, a pass of pep8 and some tests (fixmes are in failing tests) http://pastebin.com/M8TfbURi
-n
On Thu, Nov 8, 2012 at 11:38 AM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Thanks for all the responses folks. This is indeed a nice problem to solve.
Few points: I. Change the order from 'F' to 'C': I'll look into it. II. Integration with scipy / numpy: opinions are diverging here. Let's wait a bit to get more responses on what people think. One thing though: I'd need the same functionality as get_blas_funcs in numpy. Since numpy does not require lapack, what functions can I get? III. Complex arrays I unfortunately don't have enough knowledge here. If someone could propose a fix, that'd be great. IV. C Writing this in C sounds like a good idea. I'm not sure I'd be the right person to this though. V. Patch in numpy I'd love to do that and learn to do it as a byproduct. Let's make sure we agree this can go in numpy first and that all FIXME can be fixed. Although I guess we can resolve fixmes using git.
Let me know how you'd like to proceed,
Thanks!
FIXMEs:
- Fix for ndim != 2
- Fix for dtype == np.complex*
- Fix order of output array
On Thu, Nov 8, 2012 at 9:42 AM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
I also think it should go into numpy.dot and that the output order should not be changed.
A new point, what about the additional overhead for small ndarray? To remove this, I would suggest to put this code into the C function that do the actual work (at least, from memory it is a c function, not a python one).
HTH
Fred
On Thu, Nov 8, 2012 at 12:29 PM, Anthony Scopatz scopatz@gmail.com wrote:
On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau cournape@gmail.com wrote: > > On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn > d.s.seljebotn@astro.uio.no wrote: > > On 11/08/2012 01:07 PM, Gael Varoquaux wrote: > >> On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith wrote: > >>> I think everyone would be very happy to see numpy.dot modified to do > >>> this automatically. But adding a scipy.dot IMHO would be fixing > >>> things > >>> in the wrong place and just create extra confusion. > >> > >> I am not sure I agree: numpy is often compiled without lapack support, > >> as > >> it is not necessary. On the other hand scipy is always compiled with > >> lapack. Thus this makes more sens in scipy. > > > > Well, numpy.dot already contains multiple fallback cases for when it is > > compiled with BLAS and not. So I'm +1 on just making this an > > improvement > > on numpy.dot. I don't think there's a time when you would not want to > > use this (provided the output order issue is fixed), and it doesn't > > make > > sense to not have old codes take advantage of the speed improvement. > > Indeed, there is no reason not to make this available in NumPy. > > Nicolas, can you prepare a patch for numpy ?
+1, I agree, this should be a fix in numpy, not scipy.
Be Well Anthony
> > > David > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Thanks Sebastien, didn't think of that.
Well I went ahead and tried the change, and it's indeed straightforward.
I've run some tests, among which: nosetests numpy/numpy/core/tests/test_blasdot.py and it looks ok. I'm assuming this is good news.
I've copy-pasting the diff below, but I have that in my branch and can create a PR if we agree on it. I still cannot believe it's that easy (well this has been bugging me a while... ;)) So I wouldn't mind waiting a day or two to see reactions on the list before moving ahead.
diff --git a/numpy/core/blasdot/_dotblas.c b/numpy/core/blasdot/_dotblas.c index c73dd6a..2b4be7c 100644 --- a/numpy/core/blasdot/_dotblas.c +++ b/numpy/core/blasdot/_dotblas.c @@ -770,7 +770,7 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa * using appropriate values of Order, Trans1, and Trans2. */
- if (!PyArray_ISCONTIGUOUS(ap2)) { + if (!PyArray_IS_C_CONTIGUOUS(ap2) && !PyArray_IS_F_CONTIGUOUS(ap2)) { PyObject *new = PyArray_Copy(ap2);
Py_DECREF(ap2); @@ -779,7 +779,7 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa goto fail; } } - if (!PyArray_ISCONTIGUOUS(ap1)) { + if (!PyArray_IS_C_CONTIGUOUS(ap1) && !PyArray_IS_F_CONTIGUOUS(ap1)) { PyObject *new = PyArray_Copy(ap1);
Py_DECREF(ap1); @@ -800,6 +800,19 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa lda = (PyArray_DIM(ap1, 1) > 1 ? PyArray_DIM(ap1, 1) : 1); ldb = (PyArray_DIM(ap2, 1) > 1 ? PyArray_DIM(ap2, 1) : 1); ldc = (PyArray_DIM(ret, 1) > 1 ? PyArray_DIM(ret, 1) : 1); + + /* + * Avoid temporary copies for arrays in Fortran order + */ + if (PyArray_IS_F_CONTIGUOUS(ap1)) { + Trans1 = CblasTrans; + lda = (PyArray_DIM(ap1, 0) > 1 ? PyArray_DIM(ap1, 0) : 1); + } + if (PyArray_IS_F_CONTIGUOUS(ap2)) { + Trans2 = CblasTrans; + ldb = (PyArray_DIM(ap2, 0) > 1 ? PyArray_DIM(ap2, 0) : 1); + } + if (typenum == NPY_DOUBLE) { cblas_dgemm(Order, Trans1, Trans2, L, N, M,
On Thu, Nov 8, 2012 at 3:58 PM, Sebastian Berg sebastian@sipsolutions.net wrote:
On Fri, 2012-11-09 at 00:24 +0100, Sebastian Berg wrote:
Hey,
On Thu, 2012-11-08 at 14:44 -0800, Nicolas SCHEFFER wrote:
Well, hinted by what Fabien said, I looked at the C level dot function. Quite verbose!
But starting line 757, we can see that it shouldn't be too much work to fix that bug (well there is even a comment there that states just that) https://github.com/numpy/numpy/blob/master/numpy/core/blasdot/_dotblas.c#L75... I now think that should be the cleanest.
This would only work for gemm though. I don't know what the benefit is for gemv for instance, but we should make that kind of changes everywhere we can. The evil PyArray_Copy is there twice and that's what we want to get rid of.
I'm not sure, but it looks to me that removing the copy and doing the following would do the work: Order = CblasRowMajor; Trans1 = CblasNoTrans; Trans2 = CblasNoTrans; if (!PyArray_ISCONTIGUOUS(ap1)) { Trans1 = CblasTrans; } if (!PyArray_ISCONTIGUOUS(ap2)) { Trans2 = CblasTrans; } might be too easy to be true.
Sounds nice, though don't forget that the array may also be neither C- or F-Contiguous, in which case you need a copy in any case. So it would probably be more like:
if (PyArray_IS_C_CONTIGUOUS(ap1)) { Trans1 = CblasNoTrans; } else if (PyArray_IS_F_CONTIGUOUS(ap1)) { Trans1 = CblasTrans; } else { Trans1 = CblasNoTrans; PyObject *new = PyArray_Copy(ap1); Py_DECREF(ap1); ap1 = (PyArrayObject *)new; }
Well, of course I forgot error checking there, and maybe you need to set some of the other parameters differently, but it looks like its probably that easy, and I am sure everyone will welcome a PR with such changes.
Regards,
Sebastian
On Thu, Nov 8, 2012 at 12:06 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I've made the necessary changes to get the proper order for the output array. Also, a pass of pep8 and some tests (fixmes are in failing tests) http://pastebin.com/M8TfbURi
-n
On Thu, Nov 8, 2012 at 11:38 AM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Thanks for all the responses folks. This is indeed a nice problem to solve.
Few points: I. Change the order from 'F' to 'C': I'll look into it. II. Integration with scipy / numpy: opinions are diverging here. Let's wait a bit to get more responses on what people think. One thing though: I'd need the same functionality as get_blas_funcs in numpy. Since numpy does not require lapack, what functions can I get? III. Complex arrays I unfortunately don't have enough knowledge here. If someone could propose a fix, that'd be great. IV. C Writing this in C sounds like a good idea. I'm not sure I'd be the right person to this though. V. Patch in numpy I'd love to do that and learn to do it as a byproduct. Let's make sure we agree this can go in numpy first and that all FIXME can be fixed. Although I guess we can resolve fixmes using git.
Let me know how you'd like to proceed,
Thanks!
FIXMEs:
- Fix for ndim != 2
- Fix for dtype == np.complex*
- Fix order of output array
On Thu, Nov 8, 2012 at 9:42 AM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
I also think it should go into numpy.dot and that the output order should not be changed.
A new point, what about the additional overhead for small ndarray? To remove this, I would suggest to put this code into the C function that do the actual work (at least, from memory it is a c function, not a python one).
HTH
Fred
On Thu, Nov 8, 2012 at 12:29 PM, Anthony Scopatz scopatz@gmail.com wrote: > > On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau cournape@gmail.com > wrote: >> >> On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn >> d.s.seljebotn@astro.uio.no wrote: >> > On 11/08/2012 01:07 PM, Gael Varoquaux wrote: >> >> On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith wrote: >> >>> I think everyone would be very happy to see numpy.dot modified to do >> >>> this automatically. But adding a scipy.dot IMHO would be fixing >> >>> things >> >>> in the wrong place and just create extra confusion. >> >> >> >> I am not sure I agree: numpy is often compiled without lapack support, >> >> as >> >> it is not necessary. On the other hand scipy is always compiled with >> >> lapack. Thus this makes more sens in scipy. >> > >> > Well, numpy.dot already contains multiple fallback cases for when it is >> > compiled with BLAS and not. So I'm +1 on just making this an >> > improvement >> > on numpy.dot. I don't think there's a time when you would not want to >> > use this (provided the output order issue is fixed), and it doesn't >> > make >> > sense to not have old codes take advantage of the speed improvement. >> >> Indeed, there is no reason not to make this available in NumPy. >> >> Nicolas, can you prepare a patch for numpy ? > > > +1, I agree, this should be a fix in numpy, not scipy. > > Be Well > Anthony > >> >> >> David >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi,
I suspect the current tests are not enought. You need to test all the combination for the 3 inputs with thoses strides:
c-contiguous f-contiguous something else like strided.
Also, try with matrix with shape of 1 in each dimensions. Not all blas libraries accept the strides that numpy use in that cases. Also, not all blas version accept the same stuff, so if this isn't in the current version, there will be probably some adjustment later on that side. What blas do you use? I think ATLAS was one that was causing problem.
When we did this in Theano, it was more complicated then this diff... But much of the code is boillerplate code.
Fred
On Thu, Nov 8, 2012 at 8:03 PM, Nicolas SCHEFFER <scheffer.nicolas@gmail.com
wrote:
Thanks Sebastien, didn't think of that.
Well I went ahead and tried the change, and it's indeed straightforward.
I've run some tests, among which: nosetests numpy/numpy/core/tests/test_blasdot.py and it looks ok. I'm assuming this is good news.
I've copy-pasting the diff below, but I have that in my branch and can create a PR if we agree on it. I still cannot believe it's that easy (well this has been bugging me a while... ;)) So I wouldn't mind waiting a day or two to see reactions on the list before moving ahead.
diff --git a/numpy/core/blasdot/_dotblas.c b/numpy/core/blasdot/_dotblas.c index c73dd6a..2b4be7c 100644 --- a/numpy/core/blasdot/_dotblas.c +++ b/numpy/core/blasdot/_dotblas.c @@ -770,7 +770,7 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa * using appropriate values of Order, Trans1, and Trans2. */
if (!PyArray_ISCONTIGUOUS(ap2)) {
if (!PyArray_IS_C_CONTIGUOUS(ap2) &&
!PyArray_IS_F_CONTIGUOUS(ap2)) { PyObject *new = PyArray_Copy(ap2);
Py_DECREF(ap2);
@@ -779,7 +779,7 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa goto fail; } }
if (!PyArray_ISCONTIGUOUS(ap1)) {
if (!PyArray_IS_C_CONTIGUOUS(ap1) &&
!PyArray_IS_F_CONTIGUOUS(ap1)) { PyObject *new = PyArray_Copy(ap1);
Py_DECREF(ap1);
@@ -800,6 +800,19 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa lda = (PyArray_DIM(ap1, 1) > 1 ? PyArray_DIM(ap1, 1) : 1); ldb = (PyArray_DIM(ap2, 1) > 1 ? PyArray_DIM(ap2, 1) : 1); ldc = (PyArray_DIM(ret, 1) > 1 ? PyArray_DIM(ret, 1) : 1);
/*
* Avoid temporary copies for arrays in Fortran order
*/
if (PyArray_IS_F_CONTIGUOUS(ap1)) {
Trans1 = CblasTrans;
lda = (PyArray_DIM(ap1, 0) > 1 ? PyArray_DIM(ap1, 0) : 1);
}
if (PyArray_IS_F_CONTIGUOUS(ap2)) {
Trans2 = CblasTrans;
ldb = (PyArray_DIM(ap2, 0) > 1 ? PyArray_DIM(ap2, 0) : 1);
}
if (typenum == NPY_DOUBLE) { cblas_dgemm(Order, Trans1, Trans2, L, N, M,
On Thu, Nov 8, 2012 at 3:58 PM, Sebastian Berg sebastian@sipsolutions.net wrote:
On Fri, 2012-11-09 at 00:24 +0100, Sebastian Berg wrote:
Hey,
On Thu, 2012-11-08 at 14:44 -0800, Nicolas SCHEFFER wrote:
Well, hinted by what Fabien said, I looked at the C level dot
function.
Quite verbose!
But starting line 757, we can see that it shouldn't be too much work to fix that bug (well there is even a comment there that states just that)
https://github.com/numpy/numpy/blob/master/numpy/core/blasdot/_dotblas.c#L75...
I now think that should be the cleanest.
This would only work for gemm though. I don't know what the benefit is for gemv for instance, but we should make that kind of changes everywhere we can. The evil PyArray_Copy is there twice and that's what we want to get
rid of.
I'm not sure, but it looks to me that removing the copy and doing the following would do the work: Order = CblasRowMajor; Trans1 = CblasNoTrans; Trans2 = CblasNoTrans; if (!PyArray_ISCONTIGUOUS(ap1)) { Trans1 = CblasTrans; } if (!PyArray_ISCONTIGUOUS(ap2)) { Trans2 = CblasTrans; } might be too easy to be true.
Sounds nice, though don't forget that the array may also be neither C- or F-Contiguous, in which case you need a copy in any case. So it would probably be more like:
if (PyArray_IS_C_CONTIGUOUS(ap1)) { Trans1 = CblasNoTrans; } else if (PyArray_IS_F_CONTIGUOUS(ap1)) { Trans1 = CblasTrans; } else { Trans1 = CblasNoTrans; PyObject *new = PyArray_Copy(ap1); Py_DECREF(ap1); ap1 = (PyArrayObject *)new; }
Well, of course I forgot error checking there, and maybe you need to set some of the other parameters differently, but it looks like its probably that easy, and I am sure everyone will welcome a PR with such changes.
Regards,
Sebastian
On Thu, Nov 8, 2012 at 12:06 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I've made the necessary changes to get the proper order for the
output array.
Also, a pass of pep8 and some tests (fixmes are in failing tests) http://pastebin.com/M8TfbURi
-n
On Thu, Nov 8, 2012 at 11:38 AM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Thanks for all the responses folks. This is indeed a nice problem
to solve.
Few points: I. Change the order from 'F' to 'C': I'll look into it. II. Integration with scipy / numpy: opinions are diverging here. Let's wait a bit to get more responses on what people think. One thing though: I'd need the same functionality as
get_blas_funcs in numpy.
Since numpy does not require lapack, what functions can I get? III. Complex arrays I unfortunately don't have enough knowledge here. If someone could propose a fix, that'd be great. IV. C Writing this in C sounds like a good idea. I'm not sure I'd be the right person to this though. V. Patch in numpy I'd love to do that and learn to do it as a byproduct. Let's make sure we agree this can go in numpy first and that all
FIXME
can be fixed. Although I guess we can resolve fixmes using git.
Let me know how you'd like to proceed,
Thanks!
FIXMEs:
- Fix for ndim != 2
- Fix for dtype == np.complex*
- Fix order of output array
On Thu, Nov 8, 2012 at 9:42 AM, Frédéric Bastien nouiz@nouiz.org
wrote:
> Hi, > > I also think it should go into numpy.dot and that the output
order should
> not be changed. > > A new point, what about the additional overhead for small
ndarray? To remove
> this, I would suggest to put this code into the C function that
do the
> actual work (at least, from memory it is a c function, not a
python one).
> > HTH > > Fred > > > > On Thu, Nov 8, 2012 at 12:29 PM, Anthony Scopatz <
scopatz@gmail.com> wrote:
>> >> On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau <
cournape@gmail.com>
>> wrote: >>> >>> On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn >>> d.s.seljebotn@astro.uio.no wrote: >>> > On 11/08/2012 01:07 PM, Gael Varoquaux wrote: >>> >> On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith
wrote:
>>> >>> I think everyone would be very happy to see numpy.dot
modified to do
>>> >>> this automatically. But adding a scipy.dot IMHO would be
fixing
>>> >>> things >>> >>> in the wrong place and just create extra confusion. >>> >> >>> >> I am not sure I agree: numpy is often compiled without
lapack support,
>>> >> as >>> >> it is not necessary. On the other hand scipy is always
compiled with
>>> >> lapack. Thus this makes more sens in scipy. >>> > >>> > Well, numpy.dot already contains multiple fallback cases for
when it is
>>> > compiled with BLAS and not. So I'm +1 on just making this an >>> > improvement >>> > on numpy.dot. I don't think there's a time when you would not
want to
>>> > use this (provided the output order issue is fixed), and it
doesn't
>>> > make >>> > sense to not have old codes take advantage of the speed
improvement.
>>> >>> Indeed, there is no reason not to make this available in NumPy. >>> >>> Nicolas, can you prepare a patch for numpy ? >> >> >> +1, I agree, this should be a fix in numpy, not scipy. >> >> Be Well >> Anthony >> >>> >>> >>> David >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >> > > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Fred,
Thanks for the advice. The code will only affect the part in _dotblas.c where gemm is called. There's tons of check before that make sure both matrices are of ndim 2. We should check though if we can do these tricks in other parts of the function.
Otherwise: - I've built against ATLAS 3.10 - I'm happy to add a couple more test for C and F-contiguous. I'm not sure how to get the third type (strided), would you have an example?
The following test for instance checks integrity against multiarray.dot, which I believe is default when not compiled with BLAS. Dot is a hard function to test imho, so if anybody has ideas on what kind of test they'd like to see, please let me know.
If that's ok I might now be able to: - Check for more bugs, I need to dig a bit more in the gemm call, make sure everything is ok. - Create an issue on github and link to this discussion - Make a commit in a seperate branch - Move forward like that.
== import numpy as np from time import time from numpy.testing import assert_almost_equal
def test_dot_regression(): """ Test numpy dot by comparing with multiarray dot """ np.random.seed(7) a = np.random.randn(3, 3) b = np.random.randn(3, 2) c = np.random.randn(2, 3)
_dot = np.core.multiarray.dot
assert_almost_equal(np.dot(a, a), _dot(a, a)) assert_almost_equal(np.dot(b, c), _dot(b, c)) assert_almost_equal(np.dot(b.T, c.T), _dot(b.T, c.T))
assert_almost_equal(np.dot(a.T, a), _dot(a.T, a)) assert_almost_equal(np.dot(a, a.T), _dot(a, a.T)) assert_almost_equal(np.dot(a.T, a.T), _dot(a.T, a.T))
On Thu, Nov 8, 2012 at 5:34 PM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
I suspect the current tests are not enought. You need to test all the combination for the 3 inputs with thoses strides:
c-contiguous f-contiguous something else like strided.
Also, try with matrix with shape of 1 in each dimensions. Not all blas libraries accept the strides that numpy use in that cases. Also, not all blas version accept the same stuff, so if this isn't in the current version, there will be probably some adjustment later on that side. What blas do you use? I think ATLAS was one that was causing problem.
When we did this in Theano, it was more complicated then this diff... But much of the code is boillerplate code.
Fred
On Thu, Nov 8, 2012 at 8:03 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Thanks Sebastien, didn't think of that.
Well I went ahead and tried the change, and it's indeed straightforward.
I've run some tests, among which: nosetests numpy/numpy/core/tests/test_blasdot.py and it looks ok. I'm assuming this is good news.
I've copy-pasting the diff below, but I have that in my branch and can create a PR if we agree on it. I still cannot believe it's that easy (well this has been bugging me a while... ;)) So I wouldn't mind waiting a day or two to see reactions on the list before moving ahead.
diff --git a/numpy/core/blasdot/_dotblas.c b/numpy/core/blasdot/_dotblas.c index c73dd6a..2b4be7c 100644 --- a/numpy/core/blasdot/_dotblas.c +++ b/numpy/core/blasdot/_dotblas.c @@ -770,7 +770,7 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa * using appropriate values of Order, Trans1, and Trans2. */
if (!PyArray_ISCONTIGUOUS(ap2)) {
if (!PyArray_IS_C_CONTIGUOUS(ap2) &&
!PyArray_IS_F_CONTIGUOUS(ap2)) { PyObject *new = PyArray_Copy(ap2);
Py_DECREF(ap2);
@@ -779,7 +779,7 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa goto fail; } }
if (!PyArray_ISCONTIGUOUS(ap1)) {
if (!PyArray_IS_C_CONTIGUOUS(ap1) &&
!PyArray_IS_F_CONTIGUOUS(ap1)) { PyObject *new = PyArray_Copy(ap1);
Py_DECREF(ap1);
@@ -800,6 +800,19 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa lda = (PyArray_DIM(ap1, 1) > 1 ? PyArray_DIM(ap1, 1) : 1); ldb = (PyArray_DIM(ap2, 1) > 1 ? PyArray_DIM(ap2, 1) : 1); ldc = (PyArray_DIM(ret, 1) > 1 ? PyArray_DIM(ret, 1) : 1);
/*
* Avoid temporary copies for arrays in Fortran order
*/
if (PyArray_IS_F_CONTIGUOUS(ap1)) {
Trans1 = CblasTrans;
lda = (PyArray_DIM(ap1, 0) > 1 ? PyArray_DIM(ap1, 0) : 1);
}
if (PyArray_IS_F_CONTIGUOUS(ap2)) {
Trans2 = CblasTrans;
ldb = (PyArray_DIM(ap2, 0) > 1 ? PyArray_DIM(ap2, 0) : 1);
}
if (typenum == NPY_DOUBLE) { cblas_dgemm(Order, Trans1, Trans2, L, N, M,
On Thu, Nov 8, 2012 at 3:58 PM, Sebastian Berg sebastian@sipsolutions.net wrote:
On Fri, 2012-11-09 at 00:24 +0100, Sebastian Berg wrote:
Hey,
On Thu, 2012-11-08 at 14:44 -0800, Nicolas SCHEFFER wrote:
Well, hinted by what Fabien said, I looked at the C level dot function. Quite verbose!
But starting line 757, we can see that it shouldn't be too much work to fix that bug (well there is even a comment there that states just that)
https://github.com/numpy/numpy/blob/master/numpy/core/blasdot/_dotblas.c#L75... I now think that should be the cleanest.
This would only work for gemm though. I don't know what the benefit is for gemv for instance, but we should make that kind of changes everywhere we can. The evil PyArray_Copy is there twice and that's what we want to get rid of.
I'm not sure, but it looks to me that removing the copy and doing the following would do the work: Order = CblasRowMajor; Trans1 = CblasNoTrans; Trans2 = CblasNoTrans; if (!PyArray_ISCONTIGUOUS(ap1)) { Trans1 = CblasTrans; } if (!PyArray_ISCONTIGUOUS(ap2)) { Trans2 = CblasTrans; } might be too easy to be true.
Sounds nice, though don't forget that the array may also be neither C- or F-Contiguous, in which case you need a copy in any case. So it would probably be more like:
if (PyArray_IS_C_CONTIGUOUS(ap1)) { Trans1 = CblasNoTrans; } else if (PyArray_IS_F_CONTIGUOUS(ap1)) { Trans1 = CblasTrans; } else { Trans1 = CblasNoTrans; PyObject *new = PyArray_Copy(ap1); Py_DECREF(ap1); ap1 = (PyArrayObject *)new; }
Well, of course I forgot error checking there, and maybe you need to set some of the other parameters differently, but it looks like its probably that easy, and I am sure everyone will welcome a PR with such changes.
Regards,
Sebastian
On Thu, Nov 8, 2012 at 12:06 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I've made the necessary changes to get the proper order for the output array. Also, a pass of pep8 and some tests (fixmes are in failing tests) http://pastebin.com/M8TfbURi
-n
On Thu, Nov 8, 2012 at 11:38 AM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote: > Thanks for all the responses folks. This is indeed a nice problem > to solve. > > Few points: > I. Change the order from 'F' to 'C': I'll look into it. > II. Integration with scipy / numpy: opinions are diverging here. > Let's wait a bit to get more responses on what people think. > One thing though: I'd need the same functionality as > get_blas_funcs in numpy. > Since numpy does not require lapack, what functions can I get? > III. Complex arrays > I unfortunately don't have enough knowledge here. If someone could > propose a fix, that'd be great. > IV. C > Writing this in C sounds like a good idea. I'm not sure I'd be the > right person to this though. > V. Patch in numpy > I'd love to do that and learn to do it as a byproduct. > Let's make sure we agree this can go in numpy first and that all > FIXME > can be fixed. > Although I guess we can resolve fixmes using git. > > Let me know how you'd like to proceed, > > Thanks! > > FIXMEs: > - Fix for ndim != 2 > - Fix for dtype == np.complex* > - Fix order of output array > > On Thu, Nov 8, 2012 at 9:42 AM, Frédéric Bastien nouiz@nouiz.org > wrote: >> Hi, >> >> I also think it should go into numpy.dot and that the output >> order should >> not be changed. >> >> A new point, what about the additional overhead for small >> ndarray? To remove >> this, I would suggest to put this code into the C function that >> do the >> actual work (at least, from memory it is a c function, not a >> python one). >> >> HTH >> >> Fred >> >> >> >> On Thu, Nov 8, 2012 at 12:29 PM, Anthony Scopatz >> scopatz@gmail.com wrote: >>> >>> On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau >>> cournape@gmail.com >>> wrote: >>>> >>>> On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn >>>> d.s.seljebotn@astro.uio.no wrote: >>>> > On 11/08/2012 01:07 PM, Gael Varoquaux wrote: >>>> >> On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith >>>> >> wrote: >>>> >>> I think everyone would be very happy to see numpy.dot >>>> >>> modified to do >>>> >>> this automatically. But adding a scipy.dot IMHO would be >>>> >>> fixing >>>> >>> things >>>> >>> in the wrong place and just create extra confusion. >>>> >> >>>> >> I am not sure I agree: numpy is often compiled without >>>> >> lapack support, >>>> >> as >>>> >> it is not necessary. On the other hand scipy is always >>>> >> compiled with >>>> >> lapack. Thus this makes more sens in scipy. >>>> > >>>> > Well, numpy.dot already contains multiple fallback cases for >>>> > when it is >>>> > compiled with BLAS and not. So I'm +1 on just making this an >>>> > improvement >>>> > on numpy.dot. I don't think there's a time when you would not >>>> > want to >>>> > use this (provided the output order issue is fixed), and it >>>> > doesn't >>>> > make >>>> > sense to not have old codes take advantage of the speed >>>> > improvement. >>>> >>>> Indeed, there is no reason not to make this available in NumPy. >>>> >>>> Nicolas, can you prepare a patch for numpy ? >>> >>> >>> +1, I agree, this should be a fix in numpy, not scipy. >>> >>> Be Well >>> Anthony >>> >>>> >>>> >>>> David >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> >> >> >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, Nov 9, 2012 at 6:18 AM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Fred,
Thanks for the advice. The code will only affect the part in _dotblas.c where gemm is called. There's tons of check before that make sure both matrices are of ndim 2. We should check though if we can do these tricks in other parts of the function.
Otherwise:
- I've built against ATLAS 3.10
- I'm happy to add a couple more test for C and F-contiguous. I'm not
sure how to get the third type (strided), would you have an example?
def with_memory_order(a, order): assert order in ("C", "F", "discontig") assert a.ndim == 2 if order in ("C", "F"): return np.asarray(a, order=order) else: buf = np.empty((a.shape[0] * 2, a.shape[1] * 2), dtype=a.dtype) buf[::2, ::2] = a # This returns a view onto every other element of 'buf': result = buf[::2, ::2] assert not result.flags.c_contiguous and not result.flags.f_contiguous return result
The following test for instance checks integrity against multiarray.dot, which I believe is default when not compiled with BLAS. Dot is a hard function to test imho, so if anybody has ideas on what kind of test they'd like to see, please let me know.
If that's ok I might now be able to:
- Check for more bugs, I need to dig a bit more in the gemm call, make
sure everything is ok.
- Create an issue on github and link to this discussion
- Make a commit in a seperate branch
- Move forward like that.
== import numpy as np from time import time from numpy.testing import assert_almost_equal
def test_dot_regression(): """ Test numpy dot by comparing with multiarray dot """ np.random.seed(7) a = np.random.randn(3, 3) b = np.random.randn(3, 2) c = np.random.randn(2, 3)
_dot = np.core.multiarray.dot assert_almost_equal(np.dot(a, a), _dot(a, a)) assert_almost_equal(np.dot(b, c), _dot(b, c)) assert_almost_equal(np.dot(b.T, c.T), _dot(b.T, c.T)) assert_almost_equal(np.dot(a.T, a), _dot(a.T, a)) assert_almost_equal(np.dot(a, a.T), _dot(a, a.T)) assert_almost_equal(np.dot(a.T, a.T), _dot(a.T, a.T))
You should check that the result is C-contiguous in all cases too.
for a_order in ("C", "F", "discontig"): for b_order in ("C", "F", "discontig"): this_a = with_memory_order(a, a_order) this_b = with_memory_order(b, b_order) result = np.dot(this_a, this_b) assert_almost_equal(result, expected) assert result.flags.c_contiguous
You could also wrap the above in yet another loop to try a few different combinations of a and b matrices (perhaps after sticking the code into a utility function, like run_dot_tests(a, b, expected), so the indentation doesn't get out of hand ;-)). Then you can easily test some of the edge cases, like Nx1 matrices.
-n
On Thu, Nov 8, 2012 at 5:34 PM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
I suspect the current tests are not enought. You need to test all the combination for the 3 inputs with thoses strides:
c-contiguous f-contiguous something else like strided.
Also, try with matrix with shape of 1 in each dimensions. Not all blas libraries accept the strides that numpy use in that cases. Also, not all blas version accept the same stuff, so if this isn't in the current version, there will be probably some adjustment later on that side. What blas do you use? I think ATLAS was one that was causing problem.
When we did this in Theano, it was more complicated then this diff... But much of the code is boillerplate code.
Fred
On Thu, Nov 8, 2012 at 8:03 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Thanks Sebastien, didn't think of that.
Well I went ahead and tried the change, and it's indeed straightforward.
I've run some tests, among which: nosetests numpy/numpy/core/tests/test_blasdot.py and it looks ok. I'm assuming this is good news.
I've copy-pasting the diff below, but I have that in my branch and can create a PR if we agree on it. I still cannot believe it's that easy (well this has been bugging me a while... ;)) So I wouldn't mind waiting a day or two to see reactions on the list before moving ahead.
diff --git a/numpy/core/blasdot/_dotblas.c b/numpy/core/blasdot/_dotblas.c index c73dd6a..2b4be7c 100644 --- a/numpy/core/blasdot/_dotblas.c +++ b/numpy/core/blasdot/_dotblas.c @@ -770,7 +770,7 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa * using appropriate values of Order, Trans1, and Trans2. */
if (!PyArray_ISCONTIGUOUS(ap2)) {
if (!PyArray_IS_C_CONTIGUOUS(ap2) &&
!PyArray_IS_F_CONTIGUOUS(ap2)) { PyObject *new = PyArray_Copy(ap2);
Py_DECREF(ap2);
@@ -779,7 +779,7 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa goto fail; } }
if (!PyArray_ISCONTIGUOUS(ap1)) {
if (!PyArray_IS_C_CONTIGUOUS(ap1) &&
!PyArray_IS_F_CONTIGUOUS(ap1)) { PyObject *new = PyArray_Copy(ap1);
Py_DECREF(ap1);
@@ -800,6 +800,19 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa lda = (PyArray_DIM(ap1, 1) > 1 ? PyArray_DIM(ap1, 1) : 1); ldb = (PyArray_DIM(ap2, 1) > 1 ? PyArray_DIM(ap2, 1) : 1); ldc = (PyArray_DIM(ret, 1) > 1 ? PyArray_DIM(ret, 1) : 1);
/*
* Avoid temporary copies for arrays in Fortran order
*/
if (PyArray_IS_F_CONTIGUOUS(ap1)) {
Trans1 = CblasTrans;
lda = (PyArray_DIM(ap1, 0) > 1 ? PyArray_DIM(ap1, 0) : 1);
}
if (PyArray_IS_F_CONTIGUOUS(ap2)) {
Trans2 = CblasTrans;
ldb = (PyArray_DIM(ap2, 0) > 1 ? PyArray_DIM(ap2, 0) : 1);
}
if (typenum == NPY_DOUBLE) { cblas_dgemm(Order, Trans1, Trans2, L, N, M,
On Thu, Nov 8, 2012 at 3:58 PM, Sebastian Berg sebastian@sipsolutions.net wrote:
On Fri, 2012-11-09 at 00:24 +0100, Sebastian Berg wrote:
Hey,
On Thu, 2012-11-08 at 14:44 -0800, Nicolas SCHEFFER wrote:
Well, hinted by what Fabien said, I looked at the C level dot function. Quite verbose!
But starting line 757, we can see that it shouldn't be too much work to fix that bug (well there is even a comment there that states just that)
https://github.com/numpy/numpy/blob/master/numpy/core/blasdot/_dotblas.c#L75... I now think that should be the cleanest.
This would only work for gemm though. I don't know what the benefit is for gemv for instance, but we should make that kind of changes everywhere we can. The evil PyArray_Copy is there twice and that's what we want to get rid of.
I'm not sure, but it looks to me that removing the copy and doing the following would do the work: Order = CblasRowMajor; Trans1 = CblasNoTrans; Trans2 = CblasNoTrans; if (!PyArray_ISCONTIGUOUS(ap1)) { Trans1 = CblasTrans; } if (!PyArray_ISCONTIGUOUS(ap2)) { Trans2 = CblasTrans; } might be too easy to be true.
Sounds nice, though don't forget that the array may also be neither C- or F-Contiguous, in which case you need a copy in any case. So it would probably be more like:
if (PyArray_IS_C_CONTIGUOUS(ap1)) { Trans1 = CblasNoTrans; } else if (PyArray_IS_F_CONTIGUOUS(ap1)) { Trans1 = CblasTrans; } else { Trans1 = CblasNoTrans; PyObject *new = PyArray_Copy(ap1); Py_DECREF(ap1); ap1 = (PyArrayObject *)new; }
Well, of course I forgot error checking there, and maybe you need to set some of the other parameters differently, but it looks like its probably that easy, and I am sure everyone will welcome a PR with such changes.
Regards,
Sebastian
On Thu, Nov 8, 2012 at 12:06 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote: > I've made the necessary changes to get the proper order for the > output array. > Also, a pass of pep8 and some tests (fixmes are in failing tests) > http://pastebin.com/M8TfbURi > > -n > > On Thu, Nov 8, 2012 at 11:38 AM, Nicolas SCHEFFER > scheffer.nicolas@gmail.com wrote: >> Thanks for all the responses folks. This is indeed a nice problem >> to solve. >> >> Few points: >> I. Change the order from 'F' to 'C': I'll look into it. >> II. Integration with scipy / numpy: opinions are diverging here. >> Let's wait a bit to get more responses on what people think. >> One thing though: I'd need the same functionality as >> get_blas_funcs in numpy. >> Since numpy does not require lapack, what functions can I get? >> III. Complex arrays >> I unfortunately don't have enough knowledge here. If someone could >> propose a fix, that'd be great. >> IV. C >> Writing this in C sounds like a good idea. I'm not sure I'd be the >> right person to this though. >> V. Patch in numpy >> I'd love to do that and learn to do it as a byproduct. >> Let's make sure we agree this can go in numpy first and that all >> FIXME >> can be fixed. >> Although I guess we can resolve fixmes using git. >> >> Let me know how you'd like to proceed, >> >> Thanks! >> >> FIXMEs: >> - Fix for ndim != 2 >> - Fix for dtype == np.complex* >> - Fix order of output array >> >> On Thu, Nov 8, 2012 at 9:42 AM, Frédéric Bastien nouiz@nouiz.org >> wrote: >>> Hi, >>> >>> I also think it should go into numpy.dot and that the output >>> order should >>> not be changed. >>> >>> A new point, what about the additional overhead for small >>> ndarray? To remove >>> this, I would suggest to put this code into the C function that >>> do the >>> actual work (at least, from memory it is a c function, not a >>> python one). >>> >>> HTH >>> >>> Fred >>> >>> >>> >>> On Thu, Nov 8, 2012 at 12:29 PM, Anthony Scopatz >>> scopatz@gmail.com wrote: >>>> >>>> On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau >>>> cournape@gmail.com >>>> wrote: >>>>> >>>>> On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn >>>>> d.s.seljebotn@astro.uio.no wrote: >>>>> > On 11/08/2012 01:07 PM, Gael Varoquaux wrote: >>>>> >> On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith >>>>> >> wrote: >>>>> >>> I think everyone would be very happy to see numpy.dot >>>>> >>> modified to do >>>>> >>> this automatically. But adding a scipy.dot IMHO would be >>>>> >>> fixing >>>>> >>> things >>>>> >>> in the wrong place and just create extra confusion. >>>>> >> >>>>> >> I am not sure I agree: numpy is often compiled without >>>>> >> lapack support, >>>>> >> as >>>>> >> it is not necessary. On the other hand scipy is always >>>>> >> compiled with >>>>> >> lapack. Thus this makes more sens in scipy. >>>>> > >>>>> > Well, numpy.dot already contains multiple fallback cases for >>>>> > when it is >>>>> > compiled with BLAS and not. So I'm +1 on just making this an >>>>> > improvement >>>>> > on numpy.dot. I don't think there's a time when you would not >>>>> > want to >>>>> > use this (provided the output order issue is fixed), and it >>>>> > doesn't >>>>> > make >>>>> > sense to not have old codes take advantage of the speed >>>>> > improvement. >>>>> >>>>> Indeed, there is no reason not to make this available in NumPy. >>>>> >>>>> Nicolas, can you prepare a patch for numpy ? >>>> >>>> >>>> +1, I agree, this should be a fix in numpy, not scipy. >>>> >>>> Be Well >>>> Anthony >>>> >>>>> >>>>> >>>>> David >>>>> _______________________________________________ >>>>> NumPy-Discussion mailing list >>>>> NumPy-Discussion@scipy.org >>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, Nov 9, 2012 at 3:32 AM, Nathaniel Smith njs@pobox.com wrote:
On Fri, Nov 9, 2012 at 6:18 AM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Fred,
Thanks for the advice. The code will only affect the part in _dotblas.c where gemm is called. There's tons of check before that make sure both matrices are of ndim 2. We should check though if we can do these tricks in other parts of the
function.
Otherwise:
- I've built against ATLAS 3.10
- I'm happy to add a couple more test for C and F-contiguous. I'm not
sure how to get the third type (strided), would you have an example?
def with_memory_order(a, order): assert order in ("C", "F", "discontig") assert a.ndim == 2 if order in ("C", "F"): return np.asarray(a, order=order) else: buf = np.empty((a.shape[0] * 2, a.shape[1] * 2), dtype=a.dtype) buf[::2, ::2] = a # This returns a view onto every other element of 'buf': result = buf[::2, ::2] assert not result.flags.c_contiguous and not result.flags.f_contiguous return result
The following test for instance checks integrity against multiarray.dot, which I believe is default when not compiled with BLAS. Dot is a hard function to test imho, so if anybody has ideas on what kind of test they'd like to see, please let me know.
If that's ok I might now be able to:
- Check for more bugs, I need to dig a bit more in the gemm call, make
sure everything is ok.
- Create an issue on github and link to this discussion
- Make a commit in a seperate branch
- Move forward like that.
== import numpy as np from time import time from numpy.testing import assert_almost_equal
def test_dot_regression(): """ Test numpy dot by comparing with multiarray dot """ np.random.seed(7) a = np.random.randn(3, 3) b = np.random.randn(3, 2) c = np.random.randn(2, 3)
_dot = np.core.multiarray.dot assert_almost_equal(np.dot(a, a), _dot(a, a)) assert_almost_equal(np.dot(b, c), _dot(b, c)) assert_almost_equal(np.dot(b.T, c.T), _dot(b.T, c.T)) assert_almost_equal(np.dot(a.T, a), _dot(a.T, a)) assert_almost_equal(np.dot(a, a.T), _dot(a, a.T)) assert_almost_equal(np.dot(a.T, a.T), _dot(a.T, a.T))
You should check that the result is C-contiguous in all cases too.
for a_order in ("C", "F", "discontig"): for b_order in ("C", "F", "discontig"): this_a = with_memory_order(a, a_order) this_b = with_memory_order(b, b_order) result = np.dot(this_a, this_b) assert_almost_equal(result, expected) assert result.flags.c_contiguous
You could also wrap the above in yet another loop to try a few different combinations of a and b matrices (perhaps after sticking the code into a utility function, like run_dot_tests(a, b, expected), so the indentation doesn't get out of hand ;-)). Then you can easily test some of the edge cases, like Nx1 matrices.
I agree that tests are needed the for Nx1 and variant cases. I saw blas error being raised with some blas version.
You also need to test with the output provided, so there is 3 loops:
for a_order in ("C", "F", "discontig", "neg"): for b_order in ("C", "F", "discontig", "neg"): for c_order in ("C", "F", "discontig", "neg"):
I also added the stride type "neg", I'm not sure if it is needed, but that is other corner cases. neg => result = buf[::-1, ::-1]
I just looked again at our code and there is another constrain: that the strides are multiple of the elemsize. Theano do not support not aligned array, but numpy does, so there is a need for test for this. You can make an unaligned array like this:
dtype = "b1,f4" a = numpy.empty(1e4, dtype=dtype)['f1']
I just saw the strides problems that affect only some. I think that the best explaination is our code:
/* create appropriate strides for malformed matrices that are row or column * vectors, or empty matrices. * In that case, the value of the stride does not really matter, but * some versions of BLAS insist that: * - they are not smaller than the number of elements in the array, * - they are not 0. */ sx_0 = (Nx[0] > 1) ? Sx[0]/type_size : (Nx[1] + 1); sx_1 = (Nx[1] > 1) ? Sx[1]/type_size : (Nx[0] + 1); sy_0 = (Ny[0] > 1) ? Sy[0]/type_size : (Ny[1] + 1); sy_1 = (Ny[1] > 1) ? Sy[1]/type_size : (Ny[0] + 1); sz_0 = (Nz[0] > 1) ? Sz[0]/type_size : (Nz[1] + 1); sz_1 = (Nz[1] > 1) ? Sz[1]/type_size : (Nz[0] + 1);
So this ask for test with empty matrices too.
HTH
Fred
On Fri, Nov 9, 2012 at 2:45 PM, Frédéric Bastien nouiz@nouiz.org wrote:
On Fri, Nov 9, 2012 at 3:32 AM, Nathaniel Smith njs@pobox.com wrote:
On Fri, Nov 9, 2012 at 6:18 AM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Fred,
Thanks for the advice. The code will only affect the part in _dotblas.c where gemm is called. There's tons of check before that make sure both matrices are of ndim 2. We should check though if we can do these tricks in other parts of the function.
Otherwise:
- I've built against ATLAS 3.10
- I'm happy to add a couple more test for C and F-contiguous. I'm not
sure how to get the third type (strided), would you have an example?
def with_memory_order(a, order): assert order in ("C", "F", "discontig") assert a.ndim == 2 if order in ("C", "F"): return np.asarray(a, order=order) else: buf = np.empty((a.shape[0] * 2, a.shape[1] * 2), dtype=a.dtype) buf[::2, ::2] = a # This returns a view onto every other element of 'buf': result = buf[::2, ::2] assert not result.flags.c_contiguous and not result.flags.f_contiguous return result
The following test for instance checks integrity against multiarray.dot, which I believe is default when not compiled with BLAS. Dot is a hard function to test imho, so if anybody has ideas on what kind of test they'd like to see, please let me know.
If that's ok I might now be able to:
- Check for more bugs, I need to dig a bit more in the gemm call, make
sure everything is ok.
- Create an issue on github and link to this discussion
- Make a commit in a seperate branch
- Move forward like that.
== import numpy as np from time import time from numpy.testing import assert_almost_equal
def test_dot_regression(): """ Test numpy dot by comparing with multiarray dot """ np.random.seed(7) a = np.random.randn(3, 3) b = np.random.randn(3, 2) c = np.random.randn(2, 3)
_dot = np.core.multiarray.dot assert_almost_equal(np.dot(a, a), _dot(a, a)) assert_almost_equal(np.dot(b, c), _dot(b, c)) assert_almost_equal(np.dot(b.T, c.T), _dot(b.T, c.T)) assert_almost_equal(np.dot(a.T, a), _dot(a.T, a)) assert_almost_equal(np.dot(a, a.T), _dot(a, a.T)) assert_almost_equal(np.dot(a.T, a.T), _dot(a.T, a.T))
You should check that the result is C-contiguous in all cases too.
for a_order in ("C", "F", "discontig"): for b_order in ("C", "F", "discontig"): this_a = with_memory_order(a, a_order) this_b = with_memory_order(b, b_order) result = np.dot(this_a, this_b) assert_almost_equal(result, expected) assert result.flags.c_contiguous
You could also wrap the above in yet another loop to try a few different combinations of a and b matrices (perhaps after sticking the code into a utility function, like run_dot_tests(a, b, expected), so the indentation doesn't get out of hand ;-)). Then you can easily test some of the edge cases, like Nx1 matrices.
I agree that tests are needed the for Nx1 and variant cases. I saw blas error being raised with some blas version.
You also need to test with the output provided, so there is 3 loops:
for a_order in ("C", "F", "discontig", "neg"): for b_order in ("C", "F", "discontig", "neg"): for c_order in ("C", "F", "discontig", "neg"):
I also added the stride type "neg", I'm not sure if it is needed, but that is other corner cases. neg => result = buf[::-1, ::-1]
I just looked again at our code and there is another constrain: that the strides are multiple of the elemsize. Theano do not support not aligned array, but numpy does, so there is a need for test for this. You can make an unaligned array like this:
dtype = "b1,f4" a = numpy.empty(1e4, dtype=dtype)['f1']
I think you're mixing up two issues here... requiring that the strides and the itemsize match is part of the requirement for C- and F-contiguity, the example code you give here produces a non-contiguous array, so it's already checked for by the function we're talking about.
However, there can be contiguous arrays that are not aligned, like:
In [25]: a = np.empty(100, dtype="i1")[1:-3].view(np.float32)
In [26]: a.flags.c_contiguous Out[26]: True
In [27]: a.flags.aligned Out[27]: False
I suspect the np.dot code that Nicolas is looking at already checks for the ALIGNED flag and makes a copy if necessary, but it would be good to have an array like this in the tests to be sure.
-n
I just saw the strides problems that affect only some. I think that the best explaination is our code:
/* create appropriate strides for malformed matrices that are row or
column * vectors, or empty matrices.
* In that case, the value of the stride does not really matter, but * some versions of BLAS insist that: * - they are not smaller than the number of elements in the array, * - they are not 0. */ sx_0 = (Nx[0] > 1) ? Sx[0]/type_size : (Nx[1] + 1); sx_1 = (Nx[1] > 1) ? Sx[1]/type_size : (Nx[0] + 1); sy_0 = (Ny[0] > 1) ? Sy[0]/type_size : (Ny[1] + 1); sy_1 = (Ny[1] > 1) ? Sy[1]/type_size : (Ny[0] + 1); sz_0 = (Nz[0] > 1) ? Sz[0]/type_size : (Nz[1] + 1); sz_1 = (Nz[1] > 1) ? Sz[1]/type_size : (Nz[0] + 1);
So this ask for test with empty matrices too.
HTH
Fred
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, Nov 9, 2012 at 10:20 AM, Nathaniel Smith njs@pobox.com wrote:
On Fri, Nov 9, 2012 at 2:45 PM, Frédéric Bastien nouiz@nouiz.org wrote:
On Fri, Nov 9, 2012 at 3:32 AM, Nathaniel Smith njs@pobox.com wrote:
On Fri, Nov 9, 2012 at 6:18 AM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Fred,
Thanks for the advice. The code will only affect the part in _dotblas.c where gemm is called. There's tons of check before that make sure both matrices are of ndim
We should check though if we can do these tricks in other parts of the function.
Otherwise:
- I've built against ATLAS 3.10
- I'm happy to add a couple more test for C and F-contiguous. I'm not
sure how to get the third type (strided), would you have an example?
def with_memory_order(a, order): assert order in ("C", "F", "discontig") assert a.ndim == 2 if order in ("C", "F"): return np.asarray(a, order=order) else: buf = np.empty((a.shape[0] * 2, a.shape[1] * 2), dtype=a.dtype) buf[::2, ::2] = a # This returns a view onto every other element of 'buf': result = buf[::2, ::2] assert not result.flags.c_contiguous and not result.flags.f_contiguous return result
The following test for instance checks integrity against multiarray.dot, which I believe is default when not compiled with BLAS. Dot is a hard function to test imho, so if anybody has ideas on what kind of test they'd like to see, please let me know.
If that's ok I might now be able to:
- Check for more bugs, I need to dig a bit more in the gemm call, make
sure everything is ok.
- Create an issue on github and link to this discussion
- Make a commit in a seperate branch
- Move forward like that.
== import numpy as np from time import time from numpy.testing import assert_almost_equal
def test_dot_regression(): """ Test numpy dot by comparing with multiarray dot """ np.random.seed(7) a = np.random.randn(3, 3) b = np.random.randn(3, 2) c = np.random.randn(2, 3)
_dot = np.core.multiarray.dot assert_almost_equal(np.dot(a, a), _dot(a, a)) assert_almost_equal(np.dot(b, c), _dot(b, c)) assert_almost_equal(np.dot(b.T, c.T), _dot(b.T, c.T)) assert_almost_equal(np.dot(a.T, a), _dot(a.T, a)) assert_almost_equal(np.dot(a, a.T), _dot(a, a.T)) assert_almost_equal(np.dot(a.T, a.T), _dot(a.T, a.T))
You should check that the result is C-contiguous in all cases too.
for a_order in ("C", "F", "discontig"): for b_order in ("C", "F", "discontig"): this_a = with_memory_order(a, a_order) this_b = with_memory_order(b, b_order) result = np.dot(this_a, this_b) assert_almost_equal(result, expected) assert result.flags.c_contiguous
You could also wrap the above in yet another loop to try a few different combinations of a and b matrices (perhaps after sticking the code into a utility function, like run_dot_tests(a, b, expected), so the indentation doesn't get out of hand ;-)). Then you can easily test some of the edge cases, like Nx1 matrices.
I agree that tests are needed the for Nx1 and variant cases. I saw blas error being raised with some blas version.
You also need to test with the output provided, so there is 3 loops:
for a_order in ("C", "F", "discontig", "neg"): for b_order in ("C", "F", "discontig", "neg"): for c_order in ("C", "F", "discontig", "neg"):
I also added the stride type "neg", I'm not sure if it is needed, but
that
is other corner cases. neg => result = buf[::-1, ::-1]
I just looked again at our code and there is another constrain: that the strides are multiple of the elemsize. Theano do not support not aligned array, but numpy does, so there is a need for test for this. You can
make an
unaligned array like this:
dtype = "b1,f4" a = numpy.empty(1e4, dtype=dtype)['f1']
I think you're mixing up two issues here... requiring that the strides and the itemsize match is part of the requirement for C- and F-contiguity, the example code you give here produces a non-contiguous array, so it's already checked for by the function we're talking about.
However, there can be contiguous arrays that are not aligned, like:
In [25]: a = np.empty(100, dtype="i1")[1:-3].view(np.float32)
In [26]: a.flags.c_contiguous Out[26]: True
In [27]: a.flags.aligned Out[27]: False
I suspect the np.dot code that Nicolas is looking at already checks for the ALIGNED flag and makes a copy if necessary, but it would be good to have an array like this in the tests to be sure.
You are right, my test wasn't good and your example is good.
Nicolas, I hope I don't de-motivate you with all those details, this need to be done. When dealing with blas lib, the devil is in the detail... and the fact that not all lib accept the same inputs... We hit a few of them. I just try to tell you our experience to make the implementation easier and shorter to stabilize.
Fred
Hi, It's really good to see this work being done.
It would be great if this could somehow be put also into the np.einsum function, which currently doesn't even use blas, and is consequently substantially slower than current np.dot (timings on http://mail.scipy.org/pipermail/numpy-discussion/2012-October/064259.html ).
I really do think that the Einstein summation convention is a really clear way of doing array operations, and its use should be encouraged.
--George
On 9 November 2012 06:18, Nicolas SCHEFFER scheffer.nicolas@gmail.comwrote:
Fred,
Thanks for the advice. The code will only affect the part in _dotblas.c where gemm is called. There's tons of check before that make sure both matrices are of ndim 2. We should check though if we can do these tricks in other parts of the function.
Otherwise:
- I've built against ATLAS 3.10
- I'm happy to add a couple more test for C and F-contiguous. I'm not
sure how to get the third type (strided), would you have an example?
The following test for instance checks integrity against multiarray.dot, which I believe is default when not compiled with BLAS. Dot is a hard function to test imho, so if anybody has ideas on what kind of test they'd like to see, please let me know.
If that's ok I might now be able to:
- Check for more bugs, I need to dig a bit more in the gemm call, make
sure everything is ok.
- Create an issue on github and link to this discussion
- Make a commit in a seperate branch
- Move forward like that.
== import numpy as np from time import time from numpy.testing import assert_almost_equal
def test_dot_regression(): """ Test numpy dot by comparing with multiarray dot """ np.random.seed(7) a = np.random.randn(3, 3) b = np.random.randn(3, 2) c = np.random.randn(2, 3)
_dot = np.core.multiarray.dot assert_almost_equal(np.dot(a, a), _dot(a, a)) assert_almost_equal(np.dot(b, c), _dot(b, c)) assert_almost_equal(np.dot(b.T, c.T), _dot(b.T, c.T)) assert_almost_equal(np.dot(a.T, a), _dot(a.T, a)) assert_almost_equal(np.dot(a, a.T), _dot(a, a.T)) assert_almost_equal(np.dot(a.T, a.T), _dot(a.T, a.T))
On Thu, Nov 8, 2012 at 5:34 PM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
I suspect the current tests are not enought. You need to test all the combination for the 3 inputs with thoses strides:
c-contiguous f-contiguous something else like strided.
Also, try with matrix with shape of 1 in each dimensions. Not all blas libraries accept the strides that numpy use in that cases. Also, not all blas version accept the same stuff, so if this isn't in the current
version,
there will be probably some adjustment later on that side. What blas do
you
use? I think ATLAS was one that was causing problem.
When we did this in Theano, it was more complicated then this diff... But much of the code is boillerplate code.
Fred
On Thu, Nov 8, 2012 at 8:03 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
Thanks Sebastien, didn't think of that.
Well I went ahead and tried the change, and it's indeed straightforward.
I've run some tests, among which: nosetests numpy/numpy/core/tests/test_blasdot.py and it looks ok. I'm assuming this is good news.
I've copy-pasting the diff below, but I have that in my branch and can create a PR if we agree on it. I still cannot believe it's that easy (well this has been bugging me a while... ;)) So I wouldn't mind waiting a day or two to see reactions on the list before moving ahead.
diff --git a/numpy/core/blasdot/_dotblas.c
b/numpy/core/blasdot/_dotblas.c
index c73dd6a..2b4be7c 100644 --- a/numpy/core/blasdot/_dotblas.c +++ b/numpy/core/blasdot/_dotblas.c @@ -770,7 +770,7 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa * using appropriate values of Order, Trans1, and Trans2. */
if (!PyArray_ISCONTIGUOUS(ap2)) {
if (!PyArray_IS_C_CONTIGUOUS(ap2) &&
!PyArray_IS_F_CONTIGUOUS(ap2)) { PyObject *new = PyArray_Copy(ap2);
Py_DECREF(ap2);
@@ -779,7 +779,7 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa goto fail; } }
if (!PyArray_ISCONTIGUOUS(ap1)) {
if (!PyArray_IS_C_CONTIGUOUS(ap1) &&
!PyArray_IS_F_CONTIGUOUS(ap1)) { PyObject *new = PyArray_Copy(ap1);
Py_DECREF(ap1);
@@ -800,6 +800,19 @@ dotblas_matrixproduct(PyObject *NPY_UNUSED(dummy), PyObject *args, PyObject* kwa lda = (PyArray_DIM(ap1, 1) > 1 ? PyArray_DIM(ap1, 1) : 1); ldb = (PyArray_DIM(ap2, 1) > 1 ? PyArray_DIM(ap2, 1) : 1); ldc = (PyArray_DIM(ret, 1) > 1 ? PyArray_DIM(ret, 1) : 1);
/*
* Avoid temporary copies for arrays in Fortran order
*/
if (PyArray_IS_F_CONTIGUOUS(ap1)) {
Trans1 = CblasTrans;
lda = (PyArray_DIM(ap1, 0) > 1 ? PyArray_DIM(ap1, 0) : 1);
}
if (PyArray_IS_F_CONTIGUOUS(ap2)) {
Trans2 = CblasTrans;
ldb = (PyArray_DIM(ap2, 0) > 1 ? PyArray_DIM(ap2, 0) : 1);
}
if (typenum == NPY_DOUBLE) { cblas_dgemm(Order, Trans1, Trans2, L, N, M,
On Thu, Nov 8, 2012 at 3:58 PM, Sebastian Berg sebastian@sipsolutions.net wrote:
On Fri, 2012-11-09 at 00:24 +0100, Sebastian Berg wrote:
Hey,
On Thu, 2012-11-08 at 14:44 -0800, Nicolas SCHEFFER wrote:
Well, hinted by what Fabien said, I looked at the C level dot function. Quite verbose!
But starting line 757, we can see that it shouldn't be too much
work
to fix that bug (well there is even a comment there that states
just
that)
https://github.com/numpy/numpy/blob/master/numpy/core/blasdot/_dotblas.c#L75...
I now think that should be the cleanest.
This would only work for gemm though. I don't know what the benefit is for gemv for instance, but we should make that kind of changes everywhere we can. The evil PyArray_Copy is there twice and that's what we want to get rid of.
I'm not sure, but it looks to me that removing the copy and doing
the
following would do the work: Order = CblasRowMajor; Trans1 = CblasNoTrans; Trans2 = CblasNoTrans; if (!PyArray_ISCONTIGUOUS(ap1)) { Trans1 = CblasTrans; } if (!PyArray_ISCONTIGUOUS(ap2)) { Trans2 = CblasTrans; } might be too easy to be true.
Sounds nice, though don't forget that the array may also be neither
C-
or F-Contiguous, in which case you need a copy in any case. So it
would
probably be more like:
if (PyArray_IS_C_CONTIGUOUS(ap1)) { Trans1 = CblasNoTrans; } else if (PyArray_IS_F_CONTIGUOUS(ap1)) { Trans1 = CblasTrans; } else { Trans1 = CblasNoTrans; PyObject *new = PyArray_Copy(ap1); Py_DECREF(ap1); ap1 = (PyArrayObject *)new; }
Well, of course I forgot error checking there, and maybe you need to
set
some of the other parameters differently, but it looks like its
probably
that easy, and I am sure everyone will welcome a PR with such changes.
Regards,
Sebastian
On Thu, Nov 8, 2012 at 12:06 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote: > I've made the necessary changes to get the proper order for the > output array. > Also, a pass of pep8 and some tests (fixmes are in failing tests) > http://pastebin.com/M8TfbURi > > -n > > On Thu, Nov 8, 2012 at 11:38 AM, Nicolas SCHEFFER > scheffer.nicolas@gmail.com wrote: >> Thanks for all the responses folks. This is indeed a nice
problem
>> to solve. >> >> Few points: >> I. Change the order from 'F' to 'C': I'll look into it. >> II. Integration with scipy / numpy: opinions are diverging here. >> Let's wait a bit to get more responses on what people think. >> One thing though: I'd need the same functionality as >> get_blas_funcs in numpy. >> Since numpy does not require lapack, what functions can I get? >> III. Complex arrays >> I unfortunately don't have enough knowledge here. If someone
could
>> propose a fix, that'd be great. >> IV. C >> Writing this in C sounds like a good idea. I'm not sure I'd be
the
>> right person to this though. >> V. Patch in numpy >> I'd love to do that and learn to do it as a byproduct. >> Let's make sure we agree this can go in numpy first and that all >> FIXME >> can be fixed. >> Although I guess we can resolve fixmes using git. >> >> Let me know how you'd like to proceed, >> >> Thanks! >> >> FIXMEs: >> - Fix for ndim != 2 >> - Fix for dtype == np.complex* >> - Fix order of output array >> >> On Thu, Nov 8, 2012 at 9:42 AM, Frédéric Bastien <
nouiz@nouiz.org>
>> wrote: >>> Hi, >>> >>> I also think it should go into numpy.dot and that the output >>> order should >>> not be changed. >>> >>> A new point, what about the additional overhead for small >>> ndarray? To remove >>> this, I would suggest to put this code into the C function that >>> do the >>> actual work (at least, from memory it is a c function, not a >>> python one). >>> >>> HTH >>> >>> Fred >>> >>> >>> >>> On Thu, Nov 8, 2012 at 12:29 PM, Anthony Scopatz >>> scopatz@gmail.com wrote: >>>> >>>> On Thu, Nov 8, 2012 at 7:06 AM, David Cournapeau >>>> cournape@gmail.com >>>> wrote: >>>>> >>>>> On Thu, Nov 8, 2012 at 12:12 PM, Dag Sverre Seljebotn >>>>> d.s.seljebotn@astro.uio.no wrote: >>>>> > On 11/08/2012 01:07 PM, Gael Varoquaux wrote: >>>>> >> On Thu, Nov 08, 2012 at 11:28:21AM +0000, Nathaniel Smith >>>>> >> wrote: >>>>> >>> I think everyone would be very happy to see numpy.dot >>>>> >>> modified to do >>>>> >>> this automatically. But adding a scipy.dot IMHO would be >>>>> >>> fixing >>>>> >>> things >>>>> >>> in the wrong place and just create extra confusion. >>>>> >> >>>>> >> I am not sure I agree: numpy is often compiled without >>>>> >> lapack support, >>>>> >> as >>>>> >> it is not necessary. On the other hand scipy is always >>>>> >> compiled with >>>>> >> lapack. Thus this makes more sens in scipy. >>>>> > >>>>> > Well, numpy.dot already contains multiple fallback cases
for
>>>>> > when it is >>>>> > compiled with BLAS and not. So I'm +1 on just making this
an
>>>>> > improvement >>>>> > on numpy.dot. I don't think there's a time when you would
not
>>>>> > want to >>>>> > use this (provided the output order issue is fixed), and it >>>>> > doesn't >>>>> > make >>>>> > sense to not have old codes take advantage of the speed >>>>> > improvement. >>>>> >>>>> Indeed, there is no reason not to make this available in
NumPy.
>>>>> >>>>> Nicolas, can you prepare a patch for numpy ? >>>> >>>> >>>> +1, I agree, this should be a fix in numpy, not scipy. >>>> >>>> Be Well >>>> Anthony >>>> >>>>> >>>>> >>>>> David >>>>> _______________________________________________ >>>>> NumPy-Discussion mailing list >>>>> NumPy-Discussion@scipy.org >>>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>>> >>>> >>>> _______________________________________________ >>>> NumPy-Discussion mailing list >>>> NumPy-Discussion@scipy.org >>>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>>> >>> >>> >>> _______________________________________________ >>> NumPy-Discussion mailing list >>> NumPy-Discussion@scipy.org >>> http://mail.scipy.org/mailman/listinfo/numpy-discussion >>> _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Thu, Nov 08, 2012 at 11:29:19AM -0600, Anthony Scopatz wrote:
Indeed, there is no reason not to make this available in NumPy.
+1, I agree, this should be a fix in numpy, not scipy.
I agree.
My point was a bit of a side issue: given a user's computer, I have no garantee that numpy will be compiled with an optimized blas. In fact, I have often seen on computing clusters where people tend to compile stuff manually numpy using the embedded lapack_lite. On the other hand, scipy is necessarily compiled using a blas. This is why, as a developer, I always write code that uses scipy.linalg rather than numpy.linalg. However, for 'dot', I have no way of ensuring that the blas used to compile scipy is used.
Cheers,
Gaël
On 9 Nov 2012 08:48, "Gael Varoquaux" gael.varoquaux@normalesup.org wrote:
On Thu, Nov 08, 2012 at 11:29:19AM -0600, Anthony Scopatz wrote:
Indeed, there is no reason not to make this available in NumPy.
+1, I agree, this should be a fix in numpy, not scipy.
I agree.
My point was a bit of a side issue: given a user's computer, I have no garantee that numpy will be compiled with an optimized blas. In fact, I have often seen on computing clusters where people tend to compile stuff manually numpy using the embedded lapack_lite. On the other hand, scipy is necessarily compiled using a blas. This is why, as a developer, I always write code that uses scipy.linalg rather than numpy.linalg. However, for 'dot', I have no way of ensuring that the blas used to compile scipy is used.
But what if someone compiles numpy against an optimized blas (mkl, say) and then compiles SciPy against the reference blas? What do you do then!? ;-)
I mean, this would be dumb. But not really dumber than installing an optimized blas for SciPy and then configuring numpy to *not* use it...
As a higher-level point, having a scipy.dot would be a serious violation of There's Only One Way To Do It. I think that if people who are writing code against numpy/scipy have to care about our difficulties with build systems, then that's a bug in numpy/scipy and we should do a better job rather than pushing the responsibility off on our users to figure out the best combination of functions to use on each end-user's machine.
(Honestly I feel that the existence of scipy.linalg in the first place is sort of a low-level embarrassment that we should try to move away from in the long run. At least as far as the public API goes, and for all of the near-duplicate functions that are in both numpy and scipy, like qr.)
-n
Cheers,
Gaël _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
But what if someone compiles numpy against an optimized blas (mkl, say) and then compiles SciPy against the reference blas? What do you do then!? ;-)
This could happen. But the converse happens very often. What happens is that users (eg on shared computing resource) ask for a scientific python environment. The administrator than installs the package starting from the most basic one, to the most advanced one, thus starting with numpy that can very well build without any external blas. When he gets to scipy he hits the problem that the build system does not detect properly the blas, and he solves that problem.
Also, it used to be that on the major linux distributions, numpy would not be build with an optimize lapack because numpy was in the 'base' set of packages, but not lapack. On the contrary, scipy being in the 'contrib' set, it could depend on lapack. I just checked, and this has been fixed in the major distributions (Fedora, Debian, Ubuntu).
Now we can discuss with such problems should not happen, and put the blame on the users/administrators, the fact is that they happen often. I keep seeing environments in which np.linalg is unreasonnably slow.
Gael
On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux gael.varoquaux@normalesup.org wrote:
On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
But what if someone compiles numpy against an optimized blas (mkl, say) and then compiles SciPy against the reference blas? What do you do then!? ;-)
This could happen. But the converse happens very often. What happens is that users (eg on shared computing resource) ask for a scientific python environment. The administrator than installs the package starting from the most basic one, to the most advanced one, thus starting with numpy that can very well build without any external blas. When he gets to scipy he hits the problem that the build system does not detect properly the blas, and he solves that problem.
Also, it used to be that on the major linux distributions, numpy would not be build with an optimize lapack because numpy was in the 'base' set of packages, but not lapack. On the contrary, scipy being in the 'contrib' set, it could depend on lapack. I just checked, and this has been fixed in the major distributions (Fedora, Debian, Ubuntu).
Now we can discuss with such problems should not happen, and put the blame on the users/administrators, the fact is that they happen often. I keep seeing environments in which np.linalg is unreasonnably slow.
If this is something that's been a problem for you, maybe we should start another thread on things we could do to fix it directly? Improve build instructions, advertise build systems that set up the whole environment (and thus do the right thing), make numpy's setup.py scream and yell if blas isn't available...?
-n
I too encourage users to use scipy.linalg for speed and robustness (hence calling this scipy.dot), but it just brings so much confusion! When using the scipy + numpy ecosystem, you'd almost want everything be done with scipy so that you get the best implementation in all cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv().
Anyway this is indeed for another thread, the confusion we'd like to fix here is that users shouldn't have to understand the C/F contiguous concepts to get the maximum speed for np.dot()
To summarize: - The python snippet I posted is still valid and can speed up your code if you can change all your dot() calls. - The change in dotblas.c is a bit more problematic because it's very core. I'm having issues right now to replicate the timings, I've got better timing for a.dot(a.T) than for a.dot(a). There might be a bug.
It's a pain to test because I cannot do the test in a single python session. I'm going to try to integrate most of your suggestions, I cannot guarantee I'll have time to do them all though.
-nicolas On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith njs@pobox.com wrote:
On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux gael.varoquaux@normalesup.org wrote:
On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
But what if someone compiles numpy against an optimized blas (mkl, say) and then compiles SciPy against the reference blas? What do you do then!? ;-)
This could happen. But the converse happens very often. What happens is that users (eg on shared computing resource) ask for a scientific python environment. The administrator than installs the package starting from the most basic one, to the most advanced one, thus starting with numpy that can very well build without any external blas. When he gets to scipy he hits the problem that the build system does not detect properly the blas, and he solves that problem.
Also, it used to be that on the major linux distributions, numpy would not be build with an optimize lapack because numpy was in the 'base' set of packages, but not lapack. On the contrary, scipy being in the 'contrib' set, it could depend on lapack. I just checked, and this has been fixed in the major distributions (Fedora, Debian, Ubuntu).
Now we can discuss with such problems should not happen, and put the blame on the users/administrators, the fact is that they happen often. I keep seeing environments in which np.linalg is unreasonnably slow.
If this is something that's been a problem for you, maybe we should start another thread on things we could do to fix it directly? Improve build instructions, advertise build systems that set up the whole environment (and thus do the right thing), make numpy's setup.py scream and yell if blas isn't available...?
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Ok: comparing apples to apples. I'm clueless on my observations and would need input from you guys.
Using ATLAS 3.10, numpy with and without my changes, I'm getting these timings and comparisons.
# #I. Generate matrices using regular dot: # big = np.array(np.random.randn(2000, 2000), 'f'); np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T), left=big.T.dot(big), right=big.dot(big.T))"
# #II. Timings with regular dot # In [3]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [4]: %timeit np.dot(big, big.T) 10 loops, best of 3: 166 ms per loop
In [5]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 193 ms per loop
In [6]: %timeit np.dot(big.T, big) 10 loops, best of 3: 165 ms per loop
# #III. I load these arrays and time again with the "fast" dot # In [21]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [22]: %timeit np.dot(big.T, big) 10 loops, best of 3: 104 ms per loop
In [23]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 138 ms per loop
In [24]: %timeit np.dot(big, big.T) 10 loops, best of 3: 102 ms per loop
1. A'.A': great! 2. A.A' becomes faster than A.A !?!
# #IV. MSE on differences # In [25]: np.sqrt(((arr['none'] - none)**2).sum()) Out[25]: 0.0
In [26]: np.sqrt(((arr['both'] - both)**2).sum()) Out[26]: 0.0
In [27]: np.sqrt(((arr['left'] - left)**2).sum()) Out[27]: 0.015900515
In [28]: np.sqrt(((arr['right'] - right)**2).sum()) Out[28]: 0.015331409
# # CCl # While the MSE are small, I'm wondering whether: - It's a bug: it should be exactly the same - It's a feature: BLAS is taking shortcuts when you have A.A'. The difference is not significant. Quick: PR that asap!
I don't have enough expertise to answer that...
Thanks much!
-nicolas On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I too encourage users to use scipy.linalg for speed and robustness (hence calling this scipy.dot), but it just brings so much confusion! When using the scipy + numpy ecosystem, you'd almost want everything be done with scipy so that you get the best implementation in all cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv().
Anyway this is indeed for another thread, the confusion we'd like to fix here is that users shouldn't have to understand the C/F contiguous concepts to get the maximum speed for np.dot()
To summarize:
- The python snippet I posted is still valid and can speed up your
code if you can change all your dot() calls.
- The change in dotblas.c is a bit more problematic because it's very
core. I'm having issues right now to replicate the timings, I've got better timing for a.dot(a.T) than for a.dot(a). There might be a bug.
It's a pain to test because I cannot do the test in a single python session. I'm going to try to integrate most of your suggestions, I cannot guarantee I'll have time to do them all though.
-nicolas On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith njs@pobox.com wrote:
On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux gael.varoquaux@normalesup.org wrote:
On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
But what if someone compiles numpy against an optimized blas (mkl, say) and then compiles SciPy against the reference blas? What do you do then!? ;-)
This could happen. But the converse happens very often. What happens is that users (eg on shared computing resource) ask for a scientific python environment. The administrator than installs the package starting from the most basic one, to the most advanced one, thus starting with numpy that can very well build without any external blas. When he gets to scipy he hits the problem that the build system does not detect properly the blas, and he solves that problem.
Also, it used to be that on the major linux distributions, numpy would not be build with an optimize lapack because numpy was in the 'base' set of packages, but not lapack. On the contrary, scipy being in the 'contrib' set, it could depend on lapack. I just checked, and this has been fixed in the major distributions (Fedora, Debian, Ubuntu).
Now we can discuss with such problems should not happen, and put the blame on the users/administrators, the fact is that they happen often. I keep seeing environments in which np.linalg is unreasonnably slow.
If this is something that's been a problem for you, maybe we should start another thread on things we could do to fix it directly? Improve build instructions, advertise build systems that set up the whole environment (and thus do the right thing), make numpy's setup.py scream and yell if blas isn't available...?
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi,
A.A slower than A.A' is not a surprise for me. The latter is far more cache friendly that the former. Everything follows cache lines, so it is faster than something that will use one element from each cache line. In fact it is exactly what "proves" that the new version is correct. Good job (if all the tests were made and still pass ;) )
Cheers,
Matthieu
2012/11/9 Nicolas SCHEFFER scheffer.nicolas@gmail.com
Ok: comparing apples to apples. I'm clueless on my observations and would need input from you guys.
Using ATLAS 3.10, numpy with and without my changes, I'm getting these timings and comparisons.
# #I. Generate matrices using regular dot: # big = np.array(np.random.randn(2000, 2000), 'f'); np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T), left=big.T.dot(big), right=big.dot(big.T))"
# #II. Timings with regular dot # In [3]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [4]: %timeit np.dot(big, big.T) 10 loops, best of 3: 166 ms per loop
In [5]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 193 ms per loop
In [6]: %timeit np.dot(big.T, big) 10 loops, best of 3: 165 ms per loop
# #III. I load these arrays and time again with the "fast" dot # In [21]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [22]: %timeit np.dot(big.T, big) 10 loops, best of 3: 104 ms per loop
In [23]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 138 ms per loop
In [24]: %timeit np.dot(big, big.T) 10 loops, best of 3: 102 ms per loop
- A'.A': great!
- A.A' becomes faster than A.A !?!
# #IV. MSE on differences # In [25]: np.sqrt(((arr['none'] - none)**2).sum()) Out[25]: 0.0
In [26]: np.sqrt(((arr['both'] - both)**2).sum()) Out[26]: 0.0
In [27]: np.sqrt(((arr['left'] - left)**2).sum()) Out[27]: 0.015900515
In [28]: np.sqrt(((arr['right'] - right)**2).sum()) Out[28]: 0.015331409
# # CCl # While the MSE are small, I'm wondering whether:
- It's a bug: it should be exactly the same
- It's a feature: BLAS is taking shortcuts when you have A.A'. The
difference is not significant. Quick: PR that asap!
I don't have enough expertise to answer that...
Thanks much!
-nicolas On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I too encourage users to use scipy.linalg for speed and robustness (hence calling this scipy.dot), but it just brings so much confusion! When using the scipy + numpy ecosystem, you'd almost want everything be done with scipy so that you get the best implementation in all cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv().
Anyway this is indeed for another thread, the confusion we'd like to fix here is that users shouldn't have to understand the C/F contiguous concepts to get the maximum speed for np.dot()
To summarize:
- The python snippet I posted is still valid and can speed up your
code if you can change all your dot() calls.
- The change in dotblas.c is a bit more problematic because it's very
core. I'm having issues right now to replicate the timings, I've got better timing for a.dot(a.T) than for a.dot(a). There might be a bug.
It's a pain to test because I cannot do the test in a single python
session.
I'm going to try to integrate most of your suggestions, I cannot guarantee I'll have time to do them all though.
-nicolas On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith njs@pobox.com wrote:
On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux gael.varoquaux@normalesup.org wrote:
On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
But what if someone compiles numpy against an optimized blas (mkl, say) and then compiles SciPy against the reference blas? What do you do then!? ;-)
This could happen. But the converse happens very often. What happens is that users (eg on shared computing resource) ask for a scientific
python
environment. The administrator than installs the package starting from the most basic one, to the most advanced one, thus starting with numpy that can very well build without any external blas. When he gets to
scipy
he hits the problem that the build system does not detect properly the blas, and he solves that problem.
Also, it used to be that on the major linux distributions, numpy would
not
be build with an optimize lapack because numpy was in the 'base' set of packages, but not lapack. On the contrary, scipy being in the 'contrib' set, it could depend on lapack. I just checked, and this has been fixed in the major distributions (Fedora, Debian, Ubuntu).
Now we can discuss with such problems should not happen, and put the blame on the users/administrators, the fact is that they happen often.
I
keep seeing environments in which np.linalg is unreasonnably slow.
If this is something that's been a problem for you, maybe we should start another thread on things we could do to fix it directly? Improve build instructions, advertise build systems that set up the whole environment (and thus do the right thing), make numpy's setup.py scream and yell if blas isn't available...?
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Oh, about the differences. If there is something like cache blocking inside ATLAS (which would make sense), the MAD are not in exactly the same order and this would lead to some differences. You need to compare the MSE/sum of values squared to the machine precision.
Cheers,
2012/11/9 Matthieu Brucher matthieu.brucher@gmail.com
Hi,
A.A slower than A.A' is not a surprise for me. The latter is far more cache friendly that the former. Everything follows cache lines, so it is faster than something that will use one element from each cache line. In fact it is exactly what "proves" that the new version is correct. Good job (if all the tests were made and still pass ;) )
Cheers,
Matthieu
2012/11/9 Nicolas SCHEFFER scheffer.nicolas@gmail.com
Ok: comparing apples to apples. I'm clueless on my observations and would need input from you guys.
Using ATLAS 3.10, numpy with and without my changes, I'm getting these timings and comparisons.
# #I. Generate matrices using regular dot: # big = np.array(np.random.randn(2000, 2000), 'f'); np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T), left=big.T.dot(big), right=big.dot(big.T))"
# #II. Timings with regular dot # In [3]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [4]: %timeit np.dot(big, big.T) 10 loops, best of 3: 166 ms per loop
In [5]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 193 ms per loop
In [6]: %timeit np.dot(big.T, big) 10 loops, best of 3: 165 ms per loop
# #III. I load these arrays and time again with the "fast" dot # In [21]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [22]: %timeit np.dot(big.T, big) 10 loops, best of 3: 104 ms per loop
In [23]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 138 ms per loop
In [24]: %timeit np.dot(big, big.T) 10 loops, best of 3: 102 ms per loop
- A'.A': great!
- A.A' becomes faster than A.A !?!
# #IV. MSE on differences # In [25]: np.sqrt(((arr['none'] - none)**2).sum()) Out[25]: 0.0
In [26]: np.sqrt(((arr['both'] - both)**2).sum()) Out[26]: 0.0
In [27]: np.sqrt(((arr['left'] - left)**2).sum()) Out[27]: 0.015900515
In [28]: np.sqrt(((arr['right'] - right)**2).sum()) Out[28]: 0.015331409
# # CCl # While the MSE are small, I'm wondering whether:
- It's a bug: it should be exactly the same
- It's a feature: BLAS is taking shortcuts when you have A.A'. The
difference is not significant. Quick: PR that asap!
I don't have enough expertise to answer that...
Thanks much!
-nicolas On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I too encourage users to use scipy.linalg for speed and robustness (hence calling this scipy.dot), but it just brings so much confusion! When using the scipy + numpy ecosystem, you'd almost want everything be done with scipy so that you get the best implementation in all cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv().
Anyway this is indeed for another thread, the confusion we'd like to fix here is that users shouldn't have to understand the C/F contiguous concepts to get the maximum speed for np.dot()
To summarize:
- The python snippet I posted is still valid and can speed up your
code if you can change all your dot() calls.
- The change in dotblas.c is a bit more problematic because it's very
core. I'm having issues right now to replicate the timings, I've got better timing for a.dot(a.T) than for a.dot(a). There might be a bug.
It's a pain to test because I cannot do the test in a single python
session.
I'm going to try to integrate most of your suggestions, I cannot guarantee I'll have time to do them all though.
-nicolas On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith njs@pobox.com wrote:
On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux gael.varoquaux@normalesup.org wrote:
On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
But what if someone compiles numpy against an optimized blas (mkl, say) and then compiles SciPy against the reference blas? What do you do then!? ;-)
This could happen. But the converse happens very often. What happens
is
that users (eg on shared computing resource) ask for a scientific
python
environment. The administrator than installs the package starting from the most basic one, to the most advanced one, thus starting with numpy that can very well build without any external blas. When he gets to
scipy
he hits the problem that the build system does not detect properly the blas, and he solves that problem.
Also, it used to be that on the major linux distributions, numpy
would not
be build with an optimize lapack because numpy was in the 'base' set
of
packages, but not lapack. On the contrary, scipy being in the
'contrib'
set, it could depend on lapack. I just checked, and this has been
fixed
in the major distributions (Fedora, Debian, Ubuntu).
Now we can discuss with such problems should not happen, and put the blame on the users/administrators, the fact is that they happen
often. I
keep seeing environments in which np.linalg is unreasonnably slow.
If this is something that's been a problem for you, maybe we should start another thread on things we could do to fix it directly? Improve build instructions, advertise build systems that set up the whole environment (and thus do the right thing), make numpy's setup.py scream and yell if blas isn't available...?
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/
In this case it's even possible the blas is being smart enough to notice that you've passed it the same pointer twice with the transpose switch on one of them, and is just skipping half the multiplies since the output is provably symmetric. Don't know if they actually do that but...
-n On 9 Nov 2012 22:57, "Matthieu Brucher" matthieu.brucher@gmail.com wrote:
Hi,
A.A slower than A.A' is not a surprise for me. The latter is far more cache friendly that the former. Everything follows cache lines, so it is faster than something that will use one element from each cache line. In fact it is exactly what "proves" that the new version is correct. Good job (if all the tests were made and still pass ;) )
Cheers,
Matthieu
2012/11/9 Nicolas SCHEFFER scheffer.nicolas@gmail.com
Ok: comparing apples to apples. I'm clueless on my observations and would need input from you guys.
Using ATLAS 3.10, numpy with and without my changes, I'm getting these timings and comparisons.
# #I. Generate matrices using regular dot: # big = np.array(np.random.randn(2000, 2000), 'f'); np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T), left=big.T.dot(big), right=big.dot(big.T))"
# #II. Timings with regular dot # In [3]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [4]: %timeit np.dot(big, big.T) 10 loops, best of 3: 166 ms per loop
In [5]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 193 ms per loop
In [6]: %timeit np.dot(big.T, big) 10 loops, best of 3: 165 ms per loop
# #III. I load these arrays and time again with the "fast" dot # In [21]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [22]: %timeit np.dot(big.T, big) 10 loops, best of 3: 104 ms per loop
In [23]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 138 ms per loop
In [24]: %timeit np.dot(big, big.T) 10 loops, best of 3: 102 ms per loop
- A'.A': great!
- A.A' becomes faster than A.A !?!
# #IV. MSE on differences # In [25]: np.sqrt(((arr['none'] - none)**2).sum()) Out[25]: 0.0
In [26]: np.sqrt(((arr['both'] - both)**2).sum()) Out[26]: 0.0
In [27]: np.sqrt(((arr['left'] - left)**2).sum()) Out[27]: 0.015900515
In [28]: np.sqrt(((arr['right'] - right)**2).sum()) Out[28]: 0.015331409
# # CCl # While the MSE are small, I'm wondering whether:
- It's a bug: it should be exactly the same
- It's a feature: BLAS is taking shortcuts when you have A.A'. The
difference is not significant. Quick: PR that asap!
I don't have enough expertise to answer that...
Thanks much!
-nicolas On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I too encourage users to use scipy.linalg for speed and robustness (hence calling this scipy.dot), but it just brings so much confusion! When using the scipy + numpy ecosystem, you'd almost want everything be done with scipy so that you get the best implementation in all cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv().
Anyway this is indeed for another thread, the confusion we'd like to fix here is that users shouldn't have to understand the C/F contiguous concepts to get the maximum speed for np.dot()
To summarize:
- The python snippet I posted is still valid and can speed up your
code if you can change all your dot() calls.
- The change in dotblas.c is a bit more problematic because it's very
core. I'm having issues right now to replicate the timings, I've got better timing for a.dot(a.T) than for a.dot(a). There might be a bug.
It's a pain to test because I cannot do the test in a single python
session.
I'm going to try to integrate most of your suggestions, I cannot guarantee I'll have time to do them all though.
-nicolas On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith njs@pobox.com wrote:
On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux gael.varoquaux@normalesup.org wrote:
On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
But what if someone compiles numpy against an optimized blas (mkl, say) and then compiles SciPy against the reference blas? What do you do then!? ;-)
This could happen. But the converse happens very often. What happens
is
that users (eg on shared computing resource) ask for a scientific
python
environment. The administrator than installs the package starting from the most basic one, to the most advanced one, thus starting with numpy that can very well build without any external blas. When he gets to
scipy
he hits the problem that the build system does not detect properly the blas, and he solves that problem.
Also, it used to be that on the major linux distributions, numpy
would not
be build with an optimize lapack because numpy was in the 'base' set
of
packages, but not lapack. On the contrary, scipy being in the
'contrib'
set, it could depend on lapack. I just checked, and this has been
fixed
in the major distributions (Fedora, Debian, Ubuntu).
Now we can discuss with such problems should not happen, and put the blame on the users/administrators, the fact is that they happen
often. I
keep seeing environments in which np.linalg is unreasonnably slow.
If this is something that's been a problem for you, maybe we should start another thread on things we could do to fix it directly? Improve build instructions, advertise build systems that set up the whole environment (and thus do the right thing), make numpy's setup.py scream and yell if blas isn't available...?
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Well I have proof we're going things right! And it also shows the power of MKL...
I checked with EPD and my python snippet (see also my first email), and all 4 timings are the same... However, if this behavior is the right one, then my python snippet should get the same trend. But it's not.
So I though that indeed MKL could be tricking me, so I built scipy and time again with the same config (ATLAS 3.10). I call the 'dot' function I put on pastebin as dot and here are the timings: In [8]: %timeit dot(big, big) 10 loops, best of 3: 133 ms per loop
In [9]: %timeit dot(big.T, big.T) 10 loops, best of 3: 133 ms per loop
In [10]: %timeit dot(big.T, big) 10 loops, best of 3: 97.9 ms per loop
In [11]: %timeit dot(big, big.T) 10 loops, best of 3: 96.4 ms per loop
So you guys were right, it's cache friendly. We're doing the right thing, and the differences are most likely not significant (or the snippet is wrong too)! In [14]: np.sqrt(((dot(big, big.T) - np.dot(big, big.T))**2).sum()) Out[14]: 0.015331409
In [15]: np.sqrt(((dot(big, big) - np.dot(big, big))**2).sum()) Out[15]: 0.0
All this to say that we are getting a 2x speed improvement on a.dot(a.T), which is awesome.
-nicolas
Nath: same trend if arrays are different.
On Fri, Nov 9, 2012 at 3:05 PM, Nathaniel Smith njs@pobox.com wrote:
In this case it's even possible the blas is being smart enough to notice that you've passed it the same pointer twice with the transpose switch on one of them, and is just skipping half the multiplies since the output is provably symmetric. Don't know if they actually do that but...
-n
On 9 Nov 2012 22:57, "Matthieu Brucher" matthieu.brucher@gmail.com wrote:
Hi,
A.A slower than A.A' is not a surprise for me. The latter is far more cache friendly that the former. Everything follows cache lines, so it is faster than something that will use one element from each cache line. In fact it is exactly what "proves" that the new version is correct. Good job (if all the tests were made and still pass ;) )
Cheers,
Matthieu
2012/11/9 Nicolas SCHEFFER scheffer.nicolas@gmail.com
Ok: comparing apples to apples. I'm clueless on my observations and would need input from you guys.
Using ATLAS 3.10, numpy with and without my changes, I'm getting these timings and comparisons.
# #I. Generate matrices using regular dot: # big = np.array(np.random.randn(2000, 2000), 'f'); np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T), left=big.T.dot(big), right=big.dot(big.T))"
# #II. Timings with regular dot # In [3]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [4]: %timeit np.dot(big, big.T) 10 loops, best of 3: 166 ms per loop
In [5]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 193 ms per loop
In [6]: %timeit np.dot(big.T, big) 10 loops, best of 3: 165 ms per loop
# #III. I load these arrays and time again with the "fast" dot # In [21]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [22]: %timeit np.dot(big.T, big) 10 loops, best of 3: 104 ms per loop
In [23]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 138 ms per loop
In [24]: %timeit np.dot(big, big.T) 10 loops, best of 3: 102 ms per loop
- A'.A': great!
- A.A' becomes faster than A.A !?!
# #IV. MSE on differences # In [25]: np.sqrt(((arr['none'] - none)**2).sum()) Out[25]: 0.0
In [26]: np.sqrt(((arr['both'] - both)**2).sum()) Out[26]: 0.0
In [27]: np.sqrt(((arr['left'] - left)**2).sum()) Out[27]: 0.015900515
In [28]: np.sqrt(((arr['right'] - right)**2).sum()) Out[28]: 0.015331409
# # CCl # While the MSE are small, I'm wondering whether:
- It's a bug: it should be exactly the same
- It's a feature: BLAS is taking shortcuts when you have A.A'. The
difference is not significant. Quick: PR that asap!
I don't have enough expertise to answer that...
Thanks much!
-nicolas On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I too encourage users to use scipy.linalg for speed and robustness (hence calling this scipy.dot), but it just brings so much confusion! When using the scipy + numpy ecosystem, you'd almost want everything be done with scipy so that you get the best implementation in all cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv().
Anyway this is indeed for another thread, the confusion we'd like to fix here is that users shouldn't have to understand the C/F contiguous concepts to get the maximum speed for np.dot()
To summarize:
- The python snippet I posted is still valid and can speed up your
code if you can change all your dot() calls.
- The change in dotblas.c is a bit more problematic because it's very
core. I'm having issues right now to replicate the timings, I've got better timing for a.dot(a.T) than for a.dot(a). There might be a bug.
It's a pain to test because I cannot do the test in a single python session. I'm going to try to integrate most of your suggestions, I cannot guarantee I'll have time to do them all though.
-nicolas On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith njs@pobox.com wrote:
On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux gael.varoquaux@normalesup.org wrote:
On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote: > But what if someone compiles numpy against an optimized blas (mkl, > say) and then compiles SciPy against the reference blas? What do you > do then!? ;-)
This could happen. But the converse happens very often. What happens is that users (eg on shared computing resource) ask for a scientific python environment. The administrator than installs the package starting from the most basic one, to the most advanced one, thus starting with numpy that can very well build without any external blas. When he gets to scipy he hits the problem that the build system does not detect properly the blas, and he solves that problem.
Also, it used to be that on the major linux distributions, numpy would not be build with an optimize lapack because numpy was in the 'base' set of packages, but not lapack. On the contrary, scipy being in the 'contrib' set, it could depend on lapack. I just checked, and this has been fixed in the major distributions (Fedora, Debian, Ubuntu).
Now we can discuss with such problems should not happen, and put the blame on the users/administrators, the fact is that they happen often. I keep seeing environments in which np.linalg is unreasonnably slow.
If this is something that's been a problem for you, maybe we should start another thread on things we could do to fix it directly? Improve build instructions, advertise build systems that set up the whole environment (and thus do the right thing), make numpy's setup.py scream and yell if blas isn't available...?
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 11/09/2012 11:57 PM, Matthieu Brucher wrote:
Hi,
A.A slower than A.A' is not a surprise for me. The latter is far more cache friendly that the former. Everything follows cache lines, so it is faster than something that will use one element from each cache line. In fact it is exactly what "proves" that the new version is correct. Good job (if all the tests were made and still pass ;) )
Cache lines shouldn't matter much with a decent BLAS?
http://dl.acm.org/citation.cfm?id=1356053
(Googling for "Anatomy of High-Performance Matrix Multiplication" will give you a preprint outside of paywall, but Google appears to not want to give me the URL of a too long search result so I can't paste it).
Dag Sverre
Cheers,
Matthieu
2012/11/9 Nicolas SCHEFFER <scheffer.nicolas@gmail.com mailto:scheffer.nicolas@gmail.com>
Ok: comparing apples to apples. I'm clueless on my observations and would need input from you guys. Using ATLAS 3.10, numpy with and without my changes, I'm getting these timings and comparisons. # #I. Generate matrices using regular dot: # big = np.array(np.random.randn(2000, 2000), 'f'); np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T), left=big.T.dot(big), right=big.dot(big.T))" # #II. Timings with regular dot # In [3]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop In [4]: %timeit np.dot(big, big.T) 10 loops, best of 3: 166 ms per loop In [5]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 193 ms per loop In [6]: %timeit np.dot(big.T, big) 10 loops, best of 3: 165 ms per loop # #III. I load these arrays and time again with the "fast" dot # In [21]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop In [22]: %timeit np.dot(big.T, big) 10 loops, best of 3: 104 ms per loop In [23]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 138 ms per loop In [24]: %timeit np.dot(big, big.T) 10 loops, best of 3: 102 ms per loop 1. A'.A': great! 2. A.A' becomes faster than A.A !?! # #IV. MSE on differences # In [25]: np.sqrt(((arr['none'] - none)**2).sum()) Out[25]: 0.0 In [26]: np.sqrt(((arr['both'] - both)**2).sum()) Out[26]: 0.0 In [27]: np.sqrt(((arr['left'] - left)**2).sum()) Out[27]: 0.015900515 In [28]: np.sqrt(((arr['right'] - right)**2).sum()) Out[28]: 0.015331409 # # CCl # While the MSE are small, I'm wondering whether: - It's a bug: it should be exactly the same - It's a feature: BLAS is taking shortcuts when you have A.A'. The difference is not significant. Quick: PR that asap! I don't have enough expertise to answer that... Thanks much! -nicolas On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER <scheffer.nicolas@gmail.com <mailto:scheffer.nicolas@gmail.com>> wrote: > I too encourage users to use scipy.linalg for speed and robustness > (hence calling this scipy.dot), but it just brings so much confusion! > When using the scipy + numpy ecosystem, you'd almost want everything > be done with scipy so that you get the best implementation in all > cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv(). > > Anyway this is indeed for another thread, the confusion we'd like to > fix here is that users shouldn't have to understand the C/F contiguous > concepts to get the maximum speed for np.dot() > > To summarize: > - The python snippet I posted is still valid and can speed up your > code if you can change all your dot() calls. > - The change in dotblas.c is a bit more problematic because it's very > core. I'm having issues right now to replicate the timings, I've got > better timing for a.dot(a.T) than for a.dot(a). There might be a bug. > > It's a pain to test because I cannot do the test in a single python session. > I'm going to try to integrate most of your suggestions, I cannot > guarantee I'll have time to do them all though. > > -nicolas > On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith <njs@pobox.com <mailto:njs@pobox.com>> wrote: >> On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux >> <gael.varoquaux@normalesup.org <mailto:gael.varoquaux@normalesup.org>> wrote: >>> On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote: >>>> But what if someone compiles numpy against an optimized blas (mkl, >>>> say) and then compiles SciPy against the reference blas? What do you >>>> do then!? ;-) >>> >>> This could happen. But the converse happens very often. What happens is >>> that users (eg on shared computing resource) ask for a scientific python >>> environment. The administrator than installs the package starting from >>> the most basic one, to the most advanced one, thus starting with numpy >>> that can very well build without any external blas. When he gets to scipy >>> he hits the problem that the build system does not detect properly the >>> blas, and he solves that problem. >>> >>> Also, it used to be that on the major linux distributions, numpy would not >>> be build with an optimize lapack because numpy was in the 'base' set of >>> packages, but not lapack. On the contrary, scipy being in the 'contrib' >>> set, it could depend on lapack. I just checked, and this has been fixed >>> in the major distributions (Fedora, Debian, Ubuntu). >>> >>> Now we can discuss with such problems should not happen, and put the >>> blame on the users/administrators, the fact is that they happen often. I >>> keep seeing environments in which np.linalg is unreasonnably slow. >> >> If this is something that's been a problem for you, maybe we should >> start another thread on things we could do to fix it directly? Improve >> build instructions, advertise build systems that set up the whole >> environment (and thus do the right thing), make numpy's setup.py >> scream and yell if blas isn't available...? >> >> -n >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Note also that OpenBlas claims performance as good as MKL with Sandy Bridge processors.
https://github.com/xianyi/OpenBLAS/wiki/faq#wiki-sandybridge_perf
George Nurser.
On 10 November 2012 00:38, Dag Sverre Seljebotn d.s.seljebotn@astro.uio.nowrote:
On 11/09/2012 11:57 PM, Matthieu Brucher wrote:
Hi,
A.A slower than A.A' is not a surprise for me. The latter is far more cache friendly that the former. Everything follows cache lines, so it is faster than something that will use one element from each cache line. In fact it is exactly what "proves" that the new version is correct. Good job (if all the tests were made and still pass ;) )
Cache lines shouldn't matter much with a decent BLAS?
http://dl.acm.org/citation.cfm?id=1356053
(Googling for "Anatomy of High-Performance Matrix Multiplication" will give you a preprint outside of paywall, but Google appears to not want to give me the URL of a too long search result so I can't paste it).
Dag Sverre
Cheers,
Matthieu
2012/11/9 Nicolas SCHEFFER <scheffer.nicolas@gmail.com mailto:scheffer.nicolas@gmail.com>
Ok: comparing apples to apples. I'm clueless on my observations and would need input from you guys. Using ATLAS 3.10, numpy with and without my changes, I'm getting
these
timings and comparisons. # #I. Generate matrices using regular dot: # big = np.array(np.random.randn(2000, 2000), 'f'); np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T), left=big.T.dot(big), right=big.dot(big.T))" # #II. Timings with regular dot # In [3]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop In [4]: %timeit np.dot(big, big.T) 10 loops, best of 3: 166 ms per loop In [5]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 193 ms per loop In [6]: %timeit np.dot(big.T, big) 10 loops, best of 3: 165 ms per loop # #III. I load these arrays and time again with the "fast" dot # In [21]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop In [22]: %timeit np.dot(big.T, big) 10 loops, best of 3: 104 ms per loop In [23]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 138 ms per loop In [24]: %timeit np.dot(big, big.T) 10 loops, best of 3: 102 ms per loop 1. A'.A': great! 2. A.A' becomes faster than A.A !?! # #IV. MSE on differences # In [25]: np.sqrt(((arr['none'] - none)**2).sum()) Out[25]: 0.0 In [26]: np.sqrt(((arr['both'] - both)**2).sum()) Out[26]: 0.0 In [27]: np.sqrt(((arr['left'] - left)**2).sum()) Out[27]: 0.015900515 In [28]: np.sqrt(((arr['right'] - right)**2).sum()) Out[28]: 0.015331409 # # CCl # While the MSE are small, I'm wondering whether: - It's a bug: it should be exactly the same - It's a feature: BLAS is taking shortcuts when you have A.A'. The difference is not significant. Quick: PR that asap! I don't have enough expertise to answer that... Thanks much! -nicolas On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER <scheffer.nicolas@gmail.com <mailto:scheffer.nicolas@gmail.com>>
wrote:
> I too encourage users to use scipy.linalg for speed and robustness > (hence calling this scipy.dot), but it just brings so much
confusion!
> When using the scipy + numpy ecosystem, you'd almost want
everything
> be done with scipy so that you get the best implementation in all > cases: scipy.zeros(), scipy.array(), scipy.dot(),
scipy.linalg.inv().
> > Anyway this is indeed for another thread, the confusion we'd like
to
> fix here is that users shouldn't have to understand the C/F contiguous > concepts to get the maximum speed for np.dot() > > To summarize: > - The python snippet I posted is still valid and can speed up your > code if you can change all your dot() calls. > - The change in dotblas.c is a bit more problematic because it's
very
> core. I'm having issues right now to replicate the timings, I've
got
> better timing for a.dot(a.T) than for a.dot(a). There might be a
bug.
> > It's a pain to test because I cannot do the test in a single python session. > I'm going to try to integrate most of your suggestions, I cannot > guarantee I'll have time to do them all though. > > -nicolas > On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith <njs@pobox.com <mailto:njs@pobox.com>> wrote: >> On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux >> <gael.varoquaux@normalesup.org <mailto:gael.varoquaux@normalesup.org>> wrote: >>> On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote: >>>> But what if someone compiles numpy against an optimized blas
(mkl,
>>>> say) and then compiles SciPy against the reference blas? What do you >>>> do then!? ;-) >>> >>> This could happen. But the converse happens very often. What happens is >>> that users (eg on shared computing resource) ask for a scientific python >>> environment. The administrator than installs the package starting from >>> the most basic one, to the most advanced one, thus starting with numpy >>> that can very well build without any external blas. When he gets to scipy >>> he hits the problem that the build system does not detect properly the >>> blas, and he solves that problem. >>> >>> Also, it used to be that on the major linux distributions, numpy would not >>> be build with an optimize lapack because numpy was in the 'base' set of >>> packages, but not lapack. On the contrary, scipy being in the 'contrib' >>> set, it could depend on lapack. I just checked, and this has been fixed >>> in the major distributions (Fedora, Debian, Ubuntu). >>> >>> Now we can discuss with such problems should not happen, and put the >>> blame on the users/administrators, the fact is that they happen often. I >>> keep seeing environments in which np.linalg is unreasonnably
slow.
>> >> If this is something that's been a problem for you, maybe we
should
>> start another thread on things we could do to fix it directly? Improve >> build instructions, advertise build systems that set up the whole >> environment (and thus do the right thing), make numpy's setup.py >> scream and yell if blas isn't available...? >> >> -n >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I've pushed my code to a branch here https://github.com/leschef/numpy/tree/faster_dot with the commit https://github.com/leschef/numpy/commit/ea037770e03f23aca1a06274a1a8e8bf0e0e...
Let me know if that's enough to create a pull request.
Thanks,
-nicolas
On Sat, Nov 10, 2012 at 4:39 AM, George Nurser gnurser@gmail.com wrote:
Note also that OpenBlas claims performance as good as MKL with Sandy Bridge processors.
https://github.com/xianyi/OpenBLAS/wiki/faq#wiki-sandybridge_perf
George Nurser.
On 10 November 2012 00:38, Dag Sverre Seljebotn d.s.seljebotn@astro.uio.no wrote:
On 11/09/2012 11:57 PM, Matthieu Brucher wrote:
Hi,
A.A slower than A.A' is not a surprise for me. The latter is far more cache friendly that the former. Everything follows cache lines, so it is faster than something that will use one element from each cache line. In fact it is exactly what "proves" that the new version is correct. Good job (if all the tests were made and still pass ;) )
Cache lines shouldn't matter much with a decent BLAS?
http://dl.acm.org/citation.cfm?id=1356053
(Googling for "Anatomy of High-Performance Matrix Multiplication" will give you a preprint outside of paywall, but Google appears to not want to give me the URL of a too long search result so I can't paste it).
Dag Sverre
Cheers,
Matthieu
2012/11/9 Nicolas SCHEFFER <scheffer.nicolas@gmail.com mailto:scheffer.nicolas@gmail.com>
Ok: comparing apples to apples. I'm clueless on my observations and would need input from you guys. Using ATLAS 3.10, numpy with and without my changes, I'm getting
these timings and comparisons.
# #I. Generate matrices using regular dot: # big = np.array(np.random.randn(2000, 2000), 'f'); np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T), left=big.T.dot(big), right=big.dot(big.T))" # #II. Timings with regular dot # In [3]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop In [4]: %timeit np.dot(big, big.T) 10 loops, best of 3: 166 ms per loop In [5]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 193 ms per loop In [6]: %timeit np.dot(big.T, big) 10 loops, best of 3: 165 ms per loop # #III. I load these arrays and time again with the "fast" dot # In [21]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop In [22]: %timeit np.dot(big.T, big) 10 loops, best of 3: 104 ms per loop In [23]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 138 ms per loop In [24]: %timeit np.dot(big, big.T) 10 loops, best of 3: 102 ms per loop 1. A'.A': great! 2. A.A' becomes faster than A.A !?! # #IV. MSE on differences # In [25]: np.sqrt(((arr['none'] - none)**2).sum()) Out[25]: 0.0 In [26]: np.sqrt(((arr['both'] - both)**2).sum()) Out[26]: 0.0 In [27]: np.sqrt(((arr['left'] - left)**2).sum()) Out[27]: 0.015900515 In [28]: np.sqrt(((arr['right'] - right)**2).sum()) Out[28]: 0.015331409 # # CCl # While the MSE are small, I'm wondering whether: - It's a bug: it should be exactly the same - It's a feature: BLAS is taking shortcuts when you have A.A'. The difference is not significant. Quick: PR that asap! I don't have enough expertise to answer that... Thanks much! -nicolas On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER <scheffer.nicolas@gmail.com <mailto:scheffer.nicolas@gmail.com>>
wrote: > I too encourage users to use scipy.linalg for speed and robustness > (hence calling this scipy.dot), but it just brings so much confusion! > When using the scipy + numpy ecosystem, you'd almost want everything > be done with scipy so that you get the best implementation in all > cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv(). > > Anyway this is indeed for another thread, the confusion we'd like to > fix here is that users shouldn't have to understand the C/F contiguous > concepts to get the maximum speed for np.dot() > > To summarize: > - The python snippet I posted is still valid and can speed up your > code if you can change all your dot() calls. > - The change in dotblas.c is a bit more problematic because it's very > core. I'm having issues right now to replicate the timings, I've got > better timing for a.dot(a.T) than for a.dot(a). There might be a bug. > > It's a pain to test because I cannot do the test in a single python session. > I'm going to try to integrate most of your suggestions, I cannot > guarantee I'll have time to do them all though. > > -nicolas > On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith <njs@pobox.com mailto:njs@pobox.com> wrote: >> On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux >> <gael.varoquaux@normalesup.org mailto:gael.varoquaux@normalesup.org> wrote: >>> On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote: >>>> But what if someone compiles numpy against an optimized blas (mkl, >>>> say) and then compiles SciPy against the reference blas? What do you >>>> do then!? ;-) >>> >>> This could happen. But the converse happens very often. What happens is >>> that users (eg on shared computing resource) ask for a scientific python >>> environment. The administrator than installs the package starting from >>> the most basic one, to the most advanced one, thus starting with numpy >>> that can very well build without any external blas. When he gets to scipy >>> he hits the problem that the build system does not detect properly the >>> blas, and he solves that problem. >>> >>> Also, it used to be that on the major linux distributions, numpy would not >>> be build with an optimize lapack because numpy was in the 'base' set of >>> packages, but not lapack. On the contrary, scipy being in the 'contrib' >>> set, it could depend on lapack. I just checked, and this has been fixed >>> in the major distributions (Fedora, Debian, Ubuntu). >>> >>> Now we can discuss with such problems should not happen, and put the >>> blame on the users/administrators, the fact is that they happen often. I >>> keep seeing environments in which np.linalg is unreasonnably slow. >> >> If this is something that's been a problem for you, maybe we should >> start another thread on things we could do to fix it directly? Improve >> build instructions, advertise build systems that set up the whole >> environment (and thus do the right thing), make numpy's setup.py >> scream and yell if blas isn't available...? >> >> -n >> _______________________________________________ >> NumPy-Discussion mailing list >> NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org >> http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org mailto:NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, Nov 12, 2012 at 9:08 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I've pushed my code to a branch here https://github.com/leschef/numpy/tree/faster_dot with the commit https://github.com/leschef/numpy/commit/ea037770e03f23aca1a06274a1a8e8bf0e0e...
Let me know if that's enough to create a pull request.
"Pull request" basically means "this code is good enough to look at" -- the name is a bit misleading. It just creates a discussion thread where it's easy to look at the change, comment on pieces, etc., and you can continue to update the code after you start the pull request.
-n
Yep exactly.
I just want to make sure that we talked enough on the principle first (ie. goals and technical approach), and that indeed the code is good enough to look at.
I get it from your answer that it is, so I went ahead https://github.com/numpy/numpy/pull/2730
Thanks
-nicolas
On Mon, Nov 12, 2012 at 12:59 PM, Nathaniel Smith njs@pobox.com wrote:
On Mon, Nov 12, 2012 at 9:08 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I've pushed my code to a branch here https://github.com/leschef/numpy/tree/faster_dot with the commit https://github.com/leschef/numpy/commit/ea037770e03f23aca1a06274a1a8e8bf0e0e...
Let me know if that's enough to create a pull request.
"Pull request" basically means "this code is good enough to look at" -- the name is a bit misleading. It just creates a discussion thread where it's easy to look at the change, comment on pieces, etc., and you can continue to update the code after you start the pull request.
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Fri, 2012-11-09 at 14:52 -0800, Nicolas SCHEFFER wrote:
Ok: comparing apples to apples. I'm clueless on my observations and would need input from you guys.
Using ATLAS 3.10, numpy with and without my changes, I'm getting these timings and comparisons.
# #I. Generate matrices using regular dot: # big = np.array(np.random.randn(2000, 2000), 'f'); np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T), left=big.T.dot(big), right=big.dot(big.T))"
# #II. Timings with regular dot # In [3]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [4]: %timeit np.dot(big, big.T) 10 loops, best of 3: 166 ms per loop
In [5]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 193 ms per loop
In [6]: %timeit np.dot(big.T, big) 10 loops, best of 3: 165 ms per loop
# #III. I load these arrays and time again with the "fast" dot # In [21]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [22]: %timeit np.dot(big.T, big) 10 loops, best of 3: 104 ms per loop
In [23]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 138 ms per loop
In [24]: %timeit np.dot(big, big.T) 10 loops, best of 3: 102 ms per loop
- A'.A': great!
- A.A' becomes faster than A.A !?!
# #IV. MSE on differences # In [25]: np.sqrt(((arr['none'] - none)**2).sum()) Out[25]: 0.0
In [26]: np.sqrt(((arr['both'] - both)**2).sum()) Out[26]: 0.0
In [27]: np.sqrt(((arr['left'] - left)**2).sum()) Out[27]: 0.015900515
In [28]: np.sqrt(((arr['right'] - right)**2).sum()) Out[28]: 0.015331409
# # CCl # While the MSE are small, I'm wondering whether:
- It's a bug: it should be exactly the same
- It's a feature: BLAS is taking shortcuts when you have A.A'. The
difference is not significant. Quick: PR that asap!
I think its a feature because memory can be accessed in a more regular fashion in the case of one array being Fortran and the other C contiguous. IE. if its A.B' then fetching the first row is perfect, since its all behind each other in memory (C-order), while it has to be multiplied with the first column from B' which, due to being transposed, is fortran order and the column thus also one chunk in memory.
I don't have enough expertise to answer that...
Thanks much!
-nicolas On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER scheffer.nicolas@gmail.com wrote:
I too encourage users to use scipy.linalg for speed and robustness (hence calling this scipy.dot), but it just brings so much confusion! When using the scipy + numpy ecosystem, you'd almost want everything be done with scipy so that you get the best implementation in all cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv().
Anyway this is indeed for another thread, the confusion we'd like to fix here is that users shouldn't have to understand the C/F contiguous concepts to get the maximum speed for np.dot()
To summarize:
- The python snippet I posted is still valid and can speed up your
code if you can change all your dot() calls.
- The change in dotblas.c is a bit more problematic because it's very
core. I'm having issues right now to replicate the timings, I've got better timing for a.dot(a.T) than for a.dot(a). There might be a bug.
It's a pain to test because I cannot do the test in a single python session. I'm going to try to integrate most of your suggestions, I cannot guarantee I'll have time to do them all though.
-nicolas On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith njs@pobox.com wrote:
On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux gael.varoquaux@normalesup.org wrote:
On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
But what if someone compiles numpy against an optimized blas (mkl, say) and then compiles SciPy against the reference blas? What do you do then!? ;-)
This could happen. But the converse happens very often. What happens is that users (eg on shared computing resource) ask for a scientific python environment. The administrator than installs the package starting from the most basic one, to the most advanced one, thus starting with numpy that can very well build without any external blas. When he gets to scipy he hits the problem that the build system does not detect properly the blas, and he solves that problem.
Also, it used to be that on the major linux distributions, numpy would not be build with an optimize lapack because numpy was in the 'base' set of packages, but not lapack. On the contrary, scipy being in the 'contrib' set, it could depend on lapack. I just checked, and this has been fixed in the major distributions (Fedora, Debian, Ubuntu).
Now we can discuss with such problems should not happen, and put the blame on the users/administrators, the fact is that they happen often. I keep seeing environments in which np.linalg is unreasonnably slow.
If this is something that's been a problem for you, maybe we should start another thread on things we could do to fix it directly? Improve build instructions, advertise build systems that set up the whole environment (and thus do the right thing), make numpy's setup.py scream and yell if blas isn't available...?
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion