Oh, about the differences. If there is something like cache blocking inside
ATLAS (which would make sense), the MAD are not in exactly the same order
and this would lead to some differences. You need to compare the MSE/sum of
values squared to the machine precision.
Cheers,
2012/11/9 Matthieu Brucher
Hi,
A.A slower than A.A' is not a surprise for me. The latter is far more cache friendly that the former. Everything follows cache lines, so it is faster than something that will use one element from each cache line. In fact it is exactly what "proves" that the new version is correct. Good job (if all the tests were made and still pass ;) )
Cheers,
Matthieu
2012/11/9 Nicolas SCHEFFER
Ok: comparing apples to apples. I'm clueless on my observations and would need input from you guys.
Using ATLAS 3.10, numpy with and without my changes, I'm getting these timings and comparisons.
# #I. Generate matrices using regular dot: # big = np.array(np.random.randn(2000, 2000), 'f'); np.savez('out', big=big, none=big.dot(big), both=big.T.dot(big.T), left=big.T.dot(big), right=big.dot(big.T))"
# #II. Timings with regular dot # In [3]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [4]: %timeit np.dot(big, big.T) 10 loops, best of 3: 166 ms per loop
In [5]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 193 ms per loop
In [6]: %timeit np.dot(big.T, big) 10 loops, best of 3: 165 ms per loop
# #III. I load these arrays and time again with the "fast" dot # In [21]: %timeit np.dot(big, big) 10 loops, best of 3: 138 ms per loop
In [22]: %timeit np.dot(big.T, big) 10 loops, best of 3: 104 ms per loop
In [23]: %timeit np.dot(big.T, big.T) 10 loops, best of 3: 138 ms per loop
In [24]: %timeit np.dot(big, big.T) 10 loops, best of 3: 102 ms per loop
1. A'.A': great! 2. A.A' becomes faster than A.A !?!
# #IV. MSE on differences # In [25]: np.sqrt(((arr['none'] - none)**2).sum()) Out[25]: 0.0
In [26]: np.sqrt(((arr['both'] - both)**2).sum()) Out[26]: 0.0
In [27]: np.sqrt(((arr['left'] - left)**2).sum()) Out[27]: 0.015900515
In [28]: np.sqrt(((arr['right'] - right)**2).sum()) Out[28]: 0.015331409
# # CCl # While the MSE are small, I'm wondering whether: - It's a bug: it should be exactly the same - It's a feature: BLAS is taking shortcuts when you have A.A'. The difference is not significant. Quick: PR that asap!
I don't have enough expertise to answer that...
Thanks much!
I too encourage users to use scipy.linalg for speed and robustness (hence calling this scipy.dot), but it just brings so much confusion! When using the scipy + numpy ecosystem, you'd almost want everything be done with scipy so that you get the best implementation in all cases: scipy.zeros(), scipy.array(), scipy.dot(), scipy.linalg.inv().
Anyway this is indeed for another thread, the confusion we'd like to fix here is that users shouldn't have to understand the C/F contiguous concepts to get the maximum speed for np.dot()
To summarize: - The python snippet I posted is still valid and can speed up your code if you can change all your dot() calls. - The change in dotblas.c is a bit more problematic because it's very core. I'm having issues right now to replicate the timings, I've got better timing for a.dot(a.T) than for a.dot(a). There might be a bug.
It's a pain to test because I cannot do the test in a single python session. I'm going to try to integrate most of your suggestions, I cannot guarantee I'll have time to do them all though.
-nicolas On Fri, Nov 9, 2012 at 8:56 AM, Nathaniel Smith
wrote: On Fri, Nov 9, 2012 at 4:25 PM, Gael Varoquaux
wrote: On Fri, Nov 09, 2012 at 03:12:42PM +0000, Nathaniel Smith wrote:
But what if someone compiles numpy against an optimized blas (mkl, say) and then compiles SciPy against the reference blas? What do you do then!? ;-)
This could happen. But the converse happens very often. What happens is that users (eg on shared computing resource) ask for a scientific
-nicolas On Fri, Nov 9, 2012 at 2:13 PM, Nicolas SCHEFFER
wrote: python environment. The administrator than installs the package starting from the most basic one, to the most advanced one, thus starting with numpy that can very well build without any external blas. When he gets to scipy he hits the problem that the build system does not detect properly the blas, and he solves that problem.
Also, it used to be that on the major linux distributions, numpy would not be build with an optimize lapack because numpy was in the 'base' set of packages, but not lapack. On the contrary, scipy being in the 'contrib' set, it could depend on lapack. I just checked, and this has been fixed in the major distributions (Fedora, Debian, Ubuntu).
Now we can discuss with such problems should not happen, and put the blame on the users/administrators, the fact is that they happen often. I keep seeing environments in which np.linalg is unreasonnably slow.
If this is something that's been a problem for you, maybe we should start another thread on things we could do to fix it directly? Improve build instructions, advertise build systems that set up the whole environment (and thus do the right thing), make numpy's setup.py scream and yell if blas isn't available...?
-n _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/
-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher Music band: http://liliejay.com/