Basically, what you want to do is a histogram. Numpy has that functionality built in. However: the version built in to numpy is about as suboptimal as yours. The problem is the unnecessary sorting of the data. In principle, a histogram does not need any sorting, so it could be done in strictly O(N). I don't see any simply way of doing so efficiently in numpy without resorting to a costly loop. We had a discussion about this a while ago but I cannot remember any resolution. The proper thing to do would probably be a C routine. If performance is crucial to you, I would suggest using weave. A minimal example as starting point is given below. Greetings, Norbert --------------------------- #!/usr/bin/env python import numpy import scipy.weave import scipy.weave.converters def histogram(data, nR): res = numpy.zeros(nR,int) dataflat = data.flatten() Ndata = len(dataflat) scipy.weave.inline(""" for(int i=0;i<Ndata;i++) { int d = dataflat(i); if(d < nR && d >= 0) { res(d) += 1; } } """, arg_names = ["dataflat","res","Ndata","nR"], type_converters = scipy.weave.converters.blitz, ) return res a = numpy.array([0,3,1,2,3,5,2,3,2,4,5,1,3,4,2]) print a print histogram(a,4) --------------------------- Robin wrote:
Hello,
As a converting MATLAB user I am bringing over some of my code. A core function I need is to sample the probability distribution of a given set of data. Sometimes these sets are large so I would like to make it as efficient as possible. (the data are integers representing members of a discrete space)
In MATLAB the best way I found was the "diff of find of diff" trick which resulted in the completely vectorised solution (below). Does it make sense to translate this into numpy? I don't have a feel yet for what is fast/slow - are the functions below built in and so quick (I thought diff might not be).
Is there a better/more pythonic way to do it?
-------- function Pr=prob(data, nR)
Pr = zeros(nR,1); % diff of find of diff trick for counting number of elements temp = sort(data(data>0)); % want vector excluding P(0) dtemp = diff([temp;max(temp)+1]); count = diff(find([1;dtemp])); indx = temp(dtemp>0); Pr(indx)= count ./ numel(data); % probability --------
Thanks
Robin ------------------------------------------------------------------------
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion