[Numpy-discussion] Status of np.bincount

Thu May 3 09:57:30 EDT 2012

On Thu, May 3, 2012 at 2:50 PM, Robert Elsner <mlist at re-factory.de> wrote:
>
> Am 03.05.2012 15:45, schrieb Robert Kern:
>> On Thu, May 3, 2012 at 2:24 PM, Robert Elsner <mlist at re-factory.de> wrote:
>>> Hello Everybody,
>>>
>>> is there any news on the status of np.bincount with respect to "big"
>>> numbers? It seems I have just been bitten by #225. Is there an efficient
>>> way around? I found the np.histogram function painfully slow.
>>>
>>> Below a simple script, that demonstrates bincount failing with a memory
>>> error on big numbers
>>>
>>> import numpy as np
>>>
>>> x = np.array((30e9,)).astype(int)
>>> np.bincount(x)
>>>
>>>
>>> Any good idea how to work around it. My arrays contain somewhat 50M
>>> entries in the range from 0 to 30e9. And I would like to have them
>>> bincounted...
>>
>> You need a sparse data structure, then. Are you sure you even have duplicates?
>>
>> Anyways, I won't work out all of the details, but let me sketch
>> something that might get you your answers. First, sort your array.
>> Then use np.not_equal(x[:-1], x[1:]) as a mask on np.arange(1,len(x))
>> to find the indices where each sorted value changes over to the next.
>> The np.diff() of that should give you the size of each. Use np.unique
>> to get the sorted unique values to match up with those sizes.
>>
>> Fixing all of the off-by-one errors and dealing with the boundary
>> conditions correctly is left as an exercise for the reader.
>>
>
> ?? I suspect that this mail was meant to end up in the thread about
> sparse array data?

No, I am responding to you.

-- 
Robert Kern