[Numpy-discussion] Status of np.bincount

Robert Elsner mlist at re-factory.de
Thu May 3 09:50:30 EDT 2012


Am 03.05.2012 15:45, schrieb Robert Kern:
> On Thu, May 3, 2012 at 2:24 PM, Robert Elsner <mlist at re-factory.de> wrote:
>> Hello Everybody,
>>
>> is there any news on the status of np.bincount with respect to "big"
>> numbers? It seems I have just been bitten by #225. Is there an efficient
>> way around? I found the np.histogram function painfully slow.
>>
>> Below a simple script, that demonstrates bincount failing with a memory
>> error on big numbers
>>
>> import numpy as np
>>
>> x = np.array((30e9,)).astype(int)
>> np.bincount(x)
>>
>>
>> Any good idea how to work around it. My arrays contain somewhat 50M
>> entries in the range from 0 to 30e9. And I would like to have them
>> bincounted...
> 
> You need a sparse data structure, then. Are you sure you even have duplicates?
> 
> Anyways, I won't work out all of the details, but let me sketch
> something that might get you your answers. First, sort your array.
> Then use np.not_equal(x[:-1], x[1:]) as a mask on np.arange(1,len(x))
> to find the indices where each sorted value changes over to the next.
> The np.diff() of that should give you the size of each. Use np.unique
> to get the sorted unique values to match up with those sizes.
> 
> Fixing all of the off-by-one errors and dealing with the boundary
> conditions correctly is left as an exercise for the reader.
> 

?? I suspect that this mail was meant to end up in the thread about
sparse array data?



More information about the NumPy-Discussion mailing list