Going through some of the recent threads on similar problems, I'm trying to discern which is "best." I have X is T x i, theta is T x i x j. I want a T by j array that contains X[t].T.dot(theta[t]) along axis 0. Say, T,i,j = 166,7,3 X = np.random.random((T,i)) theta = np.random.random((T,i,j)) # Using sum F1 = (X[:,:,None] * theta).sum(1) # Using a loop F2 = np.zeros((T,j)) for t in range(T): for j_idx in range(j): for i_idx in range(7): F2[t,j_idx] += X[t,i_idx] * theta[t,i_idx,j_idx] # One way with dot, keeps an extra T index F3 = np.squeeze(np.dot(X[:,None,:],theta)).reshape(-1,3)[::167] # the above is way fast than F3_2 = np.dot(X,theta).reshape(-1,3)[::167] # zipped arrays F4 = np.asarray([np.dot(x,theta_i) for x,theta_i in zip(X,theta)]) It seems that F3 is the fastest way to do this given T,i,j, but it seems inefficient to compute things I don't want. Does it make sense that F3 is faster than F3_2? I'm using generic Ubuntu ATLAS on this machine. Is there a more efficient way to do this? I'm not coming up with one. Skipper