[Numpy-discussion] Histograms of extremely large data sets

Thu Dec 14 15:15:00 EST 2006

I've attached the newest version of my benchmarking code, and added one 
more evenly spaced version that works on contiguous arrays.  Turns out 
that, the numpy iterator interface overhead is noticeably slower when we 
are using faster algorithms.  The fastest (but least flexible) version 
is 5x faster than Rick's.  So, we're still not at 10x that IDL gives you.

eric

C:\eric\code\histogram>c:\Python24\python.exe histogram_speed.py
type: uint8
millions of elements: 10.0
sec (C indexing based): 0.944789631084 10000000
sec (numpy iteration based): 1.04402933892 10000000
sec (rick's pure python): 0.703124279762 10000000
sec (evenly spaced): 0.231293921434 10000000
sec (evenly spaced): 0.139521643114 10000000

Summary:
                case    sec     speed-up
  weave_1d_arbitrary    0.944790        0.744213
  weave_nd_arbitrary    1.044029        0.673472
     ricks_arbitrary    0.703124        1.000000
       weave_nd_even    0.231294        3.039960
       weave_1d_even    0.139522        5.039536

eric jones wrote:
> I just noticed a bug in this code.  "PyArray_ITER_NEXT(iter);" should be moved out of the if statement.
>
> eric
>
> eric jones wrote:
>   
>> Rick White wrote:
>>     
>>> Just so we don't get too smug about the speed, if I do this in IDL 
>>> on  the same machine it is 10 times faster (0.28 seconds instead of 
>>> 4  seconds).  I'm sure the IDL version uses the much faster approach 
>>> of  just sweeping through the array once, incrementing counts in the  
>>> appropriate bins.  It only handles equal-sized bins, so it is not as  
>>> general as the numpy version -- but equal-sized bins is a very 
>>> common  case.  I'd still like to see a C version of histogram (which 
>>> I guess  would need to be a ufunc) go into the core numpy.
>>>   
>>>       
>> Yes, this gets rid of the search, and indices can just be caluclated 
>> from offsets.  I've attached a modified weaved histogram that takes 
>> this approach.  Running the snippet below on my machine takes .118 sec 
>> for the evenly binned weave algorithm and 0.385 sec for Rick's 
>> algorithm on 5 million elements.  That is close to 4x  faster (but not 
>> 10x...), so there is indeed some speed to be gained for the common 
>> special case.  I don't know if the code I wrote has a 2x gain left in 
>> it, but I've spent zero time optimizing it.  I'd bet it can be 
>> improved substantially.
>>
>> eric
>>
>> ### test_weave_even_histogram.py
>>
>> from numpy import arange, product, sum, zeros, uint8
>> from numpy.random import randint
>>
>> import weave_even_histogram
>>
>> import time
>>
>> shape = 1000,1000,5
>> size = product(shape)
>> data = randint(0,256,size).astype(uint8)
>> bins = arange(256+1)
>>
>> print 'type:', data.dtype
>> print 'millions of elements:', size/1e6
>>
>> bin_start = 0
>> bin_size = 1
>> bin_count = 256
>> t1 = time.clock()
>> res = weave_even_histogram.histogram(data, bin_start, bin_size, 
>> bin_count)
>> t2 = time.clock()
>> print 'sec (evenly spaced):', t2-t1, sum(res)
>> print res
>>
>>
>>     
>>>                     Rick
>>> _______________________________________________
>>> Numpy-discussion mailing list
>>> Numpy-discussion at scipy.org
>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>>
>>>   
>>>       
>> ------------------------------------------------------------------------
>>
>> from numpy import array, zeros, asarray, sort, int32
>> from scipy import weave
>> from typed_array_converter import converters
>>
>> def histogram(ary, bin_start, bin_size, bin_count):
>>     
>>     ary = asarray(ary)
>>     
>>     # Create an array to hold the histogram count results.
>>     results = zeros(bin_count,dtype=int32)
>>     
>>     # The C++ code that actually does the histogramming.    
>>     code = """
>>            PyArrayIterObject *iter = (PyArrayIterObject*)PyArray_IterNew(py_ary);
>>            
>>            while(iter->index < iter->size)
>>            {
>>            
>>                //////////////////////////////////////////////////////////
>>                // binary search
>>                //////////////////////////////////////////////////////////
>>                
>>                // This requires an update to weave 
>>                ary_data_type value = *((ary_data_type*)iter->dataptr);
>>                if (value>=bin_start)
>>                {
>>                    int bin_index = (int)((value-bin_start)/bin_size);
>>                
>>                    //////////////////////////////////////////////////////////
>>                    // Bin counter increment
>>                    //////////////////////////////////////////////////////////
>>     
>>                    // If the value was found, increment the counter for that bin.
>>                    if (bin_index < bin_count)
>>                    {
>>                        results[bin_index]++;
>>                    }    
>>                    PyArray_ITER_NEXT(iter);
>>                }    
>>            }
>>            """
>>     weave.inline(code, ['ary', 'bin_start', 'bin_size','bin_count', 'results'], 
>>                  type_converters=converters, 
>>                  compiler='gcc')
>>                  
>>     return results
>>   
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion at scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>   
>>     
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: histogram.zip
Type: application/x-zip-compressed
Size: 5823 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20061214/b1ccb958/attachment.bin>