[Numpy-discussion] Histograms via indirect index arrays

Thu Mar 30 08:53:05 EST 2006

Norbert Nemec wrote:

>Sorry, Travis, I missed your answer for a week...
>
>I confess, my efficiency concerns about the histogram routine are to
>some extent, of theoretical nature. However, given the widespread use of
>the function and considering the simplicity of the solution, I think it
>is still worth thinking about it:
>
>The current histogram routine is based on a "sort" of the incoming data.
>Quicksort is usually O(N log N), worst case O(N**2). Doing the histogram
>by simply counting is a simple O(N) procedure. The absolute cost of the
>quicksort algorithm is low, but the simple counting should be even
>lower, if done efficiently in C.
>  
>

I'm not so sure about this analysis: histogram allows aribtrarily spaced 
bins, so I think the actual comparison is O(N log N) versus O(N log M) 
where M is the number of bins.   The number of bins will typically be 
much lower than N it's true, but the log will tend to wash that out 
some. Take, for example, 1024 items into 32 bins. Then log N is 10 
versus log M being 5. This difference is small enough that probably the 
differences in constant factors will dominate.

>What probably weighs heavier though, is the memory usage: sorting can
>either be done in place, destroying the original data, or requires a
>copy. 
>
I wouldn't worry about the memory usage itself so much. We all have 
oodles of RAM and virtual memory these days anyway. In addition, you'll 
probably end up copying discontiguous data anyway, which one ends up 
with rather frequently.  However, the counting algorithm should be more 
cache friendly than the sort based on. I find that operations on 
matrices that don't fit in the cache are dominated by memory access 
time, not CPU time. On my machine that happens somewhere between 10k and 
100k elements. For arrays with 100k or so elements, the counting 
algorithm might be a big win.

>Counting could be done even on an iterator without storing all the
>data.
>  
>
This is true. However, if you go through the iterator interface, it will 
be much slower than directly addressing the array memory. Such a thing 
might be cool to have somewhere (itertools or something), but probably 
isn't appropriate for numpy.

>Lacking a comparable implementation of the counting algorithm, I have
>not done any actual speed comparisons. Why nobody ever found this
>inefficiency, I cannot say. Probably nobody ever had a real comparison
>how fast the histogram could be. Unless you are dealing with really huge
>data, the difference will not show up.
>  
>
This is the crux of the issue I think. The implementaion of histogram 
looks fine for what it does. For reasonably large arrays, the overhead 
of it being implemented in Python will be negligible. Without someone 
who actually needs something different, a new implementation is liable 
to wander off into the realm of pointless microoptimization. So, my 
advice is to leave it alone till it actually causes you problems.

>Anyway, as I said, a solution should be simple to code once it is
>decided how to do it. Simply recoding the histogram routine itself in C
>would be the minimal solution. Implementing a more flexible counting
>routine that can be used in other contexts as well would certainly be
>more desirable, but I have no clear vision what that should look like.
>  
>
Well if you were going to do this. I would code a drop in replacement 
for the current histogram. No iterators (they belong in a different 
library). Variable spaces bins allowed, etc. It doesn't sound like we 
need a different histogram function and the more overlapping functions 
that there are, the less each individual one gets used and tested and 
the more likely that bugs and bitrot set in. What I'd acutally do is 
implent something that behaved like this:

def histogram_core(data, bins):
    # I hope I grabbed the right stuff out histogram_core
    # bins is an actual list of bins edges as per histogram
    # which could really us a docstring
    n = sort(a).searchsorted(bins)
    n = concatenate([n, [len(a)]])
    n = n[1:]-n[:-1]
    return n

Except written in C and using a counting algorithm. Then the histogram 
can be just a thin wrapper over histogram_core. Then test it with a 
bunch of different array sizes and bin counts and see how it does. My 
guess is that the sort based algoritm will do better in some cases 
(arrays with 10k or fewer elements and lots of bins) while the counting 
algorithm does better in others (few bins or 100k or more elements). But 
we'll see. Or maybe we won't.

Regards,

-tim

>Greetings,
>Norbert
>
>
>
>Travis Oliphant wrote:
>
>  
>
>>Norbert Nemec wrote:
>>
>>    
>>
>>>I have a very much related problem: Not only that the idea described by
>>>Mads Ipsen does not work, but I could generally find no efficient way to
>>>do a "counting" of elements in an array, as it is needed for a
>>>histogram.
>>>  
>>>      
>>>
>>This may be something we are lacking. 
>>
>>It depends on what you mean by efficient, I suppose.  The sorting
>>algorithms are very fast, and the histogram routines that are
>>scattered all over the place (including the histogram function in
>>numpy) has been in use for a long time, and you are the first person
>>to complain of its efficiency.  That isn't to say your complaint may
>>not be valid, it's just that for most people the speed has been
>>sufficient.
>>
>>    
>>
>>>What would instead be needed is a function that simply gives the count
>>>of occurances of given values in a given array:
>>>  
>>>      
>>>
>>I presume you are talking of "integer" arrays,  since
>>equality-comparison of floating-point values is usually not very
>>helpful so most histograms on floating-point values are given in terms
>>of bins.  Thus, the searchsorted function uses bins for it's
>>"counting" operation.
>>
>>    
>>
>>> 
>>>
>>>      
>>>
>>>>>>[4,5,2,3,2,1,4].count([0,1,2,3,4,5])
>>>>>>        
>>>>>>            
>>>>>>
>>>[0,1,2,1,1,2]
>>>
>>>  
>>>      
>>>
>>A count function for integer arrays could certainly be written using a
>>C-loop.  But, I would first just use histogram([4,5,2,3,2,1,4],
>>[0,1,2,3,4,5]) and make sure that's really too slow, before worrying
>>about it too much.
>>
>>Also, I according to the above function, the right answer is:
>>
>>[0, 1, 2, 1, 2, 1]
>>
>>
>>Best,
>>
>>
>>-Travis
>>
>>
>>
>>
>>-------------------------------------------------------
>>This SF.Net email is sponsored by xPML, a groundbreaking scripting
>>language
>>that extends applications into web and mobile media. Attend the live
>>webcast
>>and join the prime developer group breaking into this new coding
>>territory!
>>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
>>_______________________________________________
>>Numpy-discussion mailing list
>>Numpy-discussion at lists.sourceforge.net
>>https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>>
>>
>>    
>>
>
>
>
>-------------------------------------------------------
>This SF.Net email is sponsored by xPML, a groundbreaking scripting language
>that extends applications into web and mobile media. Attend the live webcast
>and join the prime developer group breaking into this new coding territory!
>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
>_______________________________________________
>Numpy-discussion mailing list
>Numpy-discussion at lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/numpy-discussion
>
>
>  
>