[Numpy-discussion] Histograms via indirect index arrays

Thu Mar 30 06:14:17 EST 2006

Sorry, Travis, I missed your answer for a week...

I confess, my efficiency concerns about the histogram routine are to
some extent, of theoretical nature. However, given the widespread use of
the function and considering the simplicity of the solution, I think it
is still worth thinking about it:

The current histogram routine is based on a "sort" of the incoming data.
Quicksort is usually O(N log N), worst case O(N**2). Doing the histogram
by simply counting is a simple O(N) procedure. The absolute cost of the
quicksort algorithm is low, but the simple counting should be even
lower, if done efficiently in C.

What probably weighs heavier though, is the memory usage: sorting can
either be done in place, destroying the original data, or requires a
copy. Counting could be done even on an iterator without storing all the
data.

Lacking a comparable implementation of the counting algorithm, I have
not done any actual speed comparisons. Why nobody ever found this
inefficiency, I cannot say. Probably nobody ever had a real comparison
how fast the histogram could be. Unless you are dealing with really huge
data, the difference will not show up.

Anyway, as I said, a solution should be simple to code once it is
decided how to do it. Simply recoding the histogram routine itself in C
would be the minimal solution. Implementing a more flexible counting
routine that can be used in other contexts as well would certainly be
more desirable, but I have no clear vision what that should look like.

Greetings,
Norbert

Travis Oliphant wrote:

> Norbert Nemec wrote:
>
>> I have a very much related problem: Not only that the idea described by
>> Mads Ipsen does not work, but I could generally find no efficient way to
>> do a "counting" of elements in an array, as it is needed for a
>> histogram.
>>   
>
> This may be something we are lacking. 
>
> It depends on what you mean by efficient, I suppose.  The sorting
> algorithms are very fast, and the histogram routines that are
> scattered all over the place (including the histogram function in
> numpy) has been in use for a long time, and you are the first person
> to complain of its efficiency.  That isn't to say your complaint may
> not be valid, it's just that for most people the speed has been
> sufficient.
>
>> What would instead be needed is a function that simply gives the count
>> of occurances of given values in a given array:
>>   
>
> I presume you are talking of "integer" arrays,  since
> equality-comparison of floating-point values is usually not very
> helpful so most histograms on floating-point values are given in terms
> of bins.  Thus, the searchsorted function uses bins for it's
> "counting" operation.
>
>>  
>>
>>>>> [4,5,2,3,2,1,4].count([0,1,2,3,4,5])
>>>>>         
>>>>
>> [0,1,2,1,1,2]
>>
>>   
>
>
> A count function for integer arrays could certainly be written using a
> C-loop.  But, I would first just use histogram([4,5,2,3,2,1,4],
> [0,1,2,3,4,5]) and make sure that's really too slow, before worrying
> about it too much.
>
> Also, I according to the above function, the right answer is:
>
> [0, 1, 2, 1, 2, 1]
>
>
> Best,
>
>
> -Travis
>
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by xPML, a groundbreaking scripting
> language
> that extends applications into web and mobile media. Attend the live
> webcast
> and join the prime developer group breaking into this new coding
> territory!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>
>