
Hi list,
Suppose I have array a with dimensions (d1, d3) and array b with dimensions (d2, d3). I want to compute array c with dimensions (d1, d2) holding the squared euclidian norms of vectors in a and b with size d3.
My first take was to use a python level loop:
from numpy import * c = array([sum((a_i - b) ** 2, axis=1) for a_i in a])
But this is too slow and allocate a useless temporary list of python references.
To avoid the python level loop I then tried to use broadcasting as follows:
c = sum((a[:,newaxis,:] - b) ** 2, axis=2)
But this build a useless and huge (d1, d2, d3) temporary array that does not fit in memory for large values of d1, d2 and d3...
Do you have any better idea? I would like to simulate a runtime behavior similar to:
c = dot(a, b.T)
but for for squared euclidian norms instead of dotproducts.
I can always write a the code in C and wrap it with ctypes but I wondered whether this is possible only with numpy.

Hi Olivier
2008/12/4 Olivier Grisel olivier.grisel@ensta.org:
To avoid the python level loop I then tried to use broadcasting as follows:
c = sum((a[:,newaxis,:] - b) ** 2, axis=2)
But this build a useless and huge (d1, d2, d3) temporary array that does not fit in memory for large values of d1, d2 and d3...
Does numpy.lib.broadcast_arrays do what you need?
Regards Stéfan

2008/12/4 Stéfan van der Walt stefan@sun.ac.za:
Hi Olivier
2008/12/4 Olivier Grisel olivier.grisel@ensta.org:
To avoid the python level loop I then tried to use broadcasting as follows:
c = sum((a[:,newaxis,:] - b) ** 2, axis=2)
But this build a useless and huge (d1, d2, d3) temporary array that does not fit in memory for large values of d1, d2 and d3...
Does numpy.lib.broadcast_arrays do what you need?
That looks exactly what I am looking for. Apparently this is new in 1.2 since I cannot find it in the 1.1 version of my system.
Thanks,

On Thu, Dec 4, 2008 at 8:26 AM, Olivier Grisel olivier.grisel@ensta.orgwrote:
Hi list,
Suppose I have array a with dimensions (d1, d3) and array b with dimensions (d2, d3). I want to compute array c with dimensions (d1, d2) holding the squared euclidian norms of vectors in a and b with size d3.
Just to clarify the problem a bit, it looks like you want to compute the squared euclidean distance between every vector in a and every vector in b, i.e., a distance matrix. Is that correct? Also, how big are d1,d2,d3?
If you *are* looking to compute the distance matrix I suspect your end goal is something beyond that. Could you describe what you are trying to do? I could be that scipy.spatial or scipy.cluster are what you should look at.
<snip>
Chuck

2008/12/4 Charles R Harris charlesr.harris@gmail.com:
On Thu, Dec 4, 2008 at 8:26 AM, Olivier Grisel olivier.grisel@ensta.org wrote:
Hi list,
Suppose I have array a with dimensions (d1, d3) and array b with dimensions (d2, d3). I want to compute array c with dimensions (d1, d2) holding the squared euclidian norms of vectors in a and b with size d3.
Just to clarify the problem a bit, it looks like you want to compute the squared euclidean distance between every vector in a and every vector in b, i.e., a distance matrix. Is that correct? Also, how big are d1,d2,d3?
I would target d1 >> d2 ~ d3 with d1 as large as possible to fit in memory and d2 and d3 in the order of a couple hundreds or thousands for a start.
If you *are* looking to compute the distance matrix I suspect your end goal is something beyond that. Could you describe what you are trying to do?
My end goal it to compute the activation of an array of Radial Basis Function units where the activation of unit with center b_j for data vector a_i is given by:
f(a_i, b_j) = exp(-||a_i - bj|| ** 2 / (2 * sigma))
The end goal is to have building blocks of various parameterized array of homogeneous units (linear, sigmoid and RBF) along with their gradient in parameter space so as too build various machine learning algorithms such as multi layer perceptrons with various training strategies such as Stochastic Gradient Descent. That code might be integrated into the Modular Data Processing (MPD toolkit) project [1] at some point.
The current stat of the python code is here:
http://www.bitbucket.org/ogrisel/oglab/src/186eab341408/simdkernel/src/simdk...
You can find an SSE optimized C implementation wrapped with ctypes here:
http://www.bitbucket.org/ogrisel/oglab/src/186eab341408/simdkernel/src/simdk... http://www.bitbucket.org/ogrisel/oglab/src/186eab341408/simdkernel/src/simdk...
It could be that scipy.spatial or scipy.cluster are what you should look at.
I'll have a look at those, thanks for the pointer.
[1] http://mdp-toolkit.sourceforge.net/
participants (3)
-
Charles R Harris
-
Olivier Grisel
-
Stéfan van der Walt