[Numpy-discussion] Status of np.bincount

Robert Kern robert.kern at gmail.com
Thu May 3 09:45:44 EDT 2012


On Thu, May 3, 2012 at 2:24 PM, Robert Elsner <mlist at re-factory.de> wrote:
> Hello Everybody,
>
> is there any news on the status of np.bincount with respect to "big"
> numbers? It seems I have just been bitten by #225. Is there an efficient
> way around? I found the np.histogram function painfully slow.
>
> Below a simple script, that demonstrates bincount failing with a memory
> error on big numbers
>
> import numpy as np
>
> x = np.array((30e9,)).astype(int)
> np.bincount(x)
>
>
> Any good idea how to work around it. My arrays contain somewhat 50M
> entries in the range from 0 to 30e9. And I would like to have them
> bincounted...

You need a sparse data structure, then. Are you sure you even have duplicates?

Anyways, I won't work out all of the details, but let me sketch
something that might get you your answers. First, sort your array.
Then use np.not_equal(x[:-1], x[1:]) as a mask on np.arange(1,len(x))
to find the indices where each sorted value changes over to the next.
The np.diff() of that should give you the size of each. Use np.unique
to get the sorted unique values to match up with those sizes.

Fixing all of the off-by-one errors and dealing with the boundary
conditions correctly is left as an exercise for the reader.

-- 
Robert Kern



More information about the NumPy-Discussion mailing list