[Numpy-discussion] Status of np.bincount

Tony Yu tsyu80 at gmail.com
Thu May 3 12:51:30 EDT 2012


On Thu, May 3, 2012 at 9:57 AM, Robert Kern <robert.kern at gmail.com> wrote:

> On Thu, May 3, 2012 at 2:50 PM, Robert Elsner <mlist at re-factory.de> wrote:
> >
> > Am 03.05.2012 15:45, schrieb Robert Kern:
> >> On Thu, May 3, 2012 at 2:24 PM, Robert Elsner <mlist at re-factory.de>
> wrote:
> >>> Hello Everybody,
> >>>
> >>> is there any news on the status of np.bincount with respect to "big"
> >>> numbers? It seems I have just been bitten by #225. Is there an
> efficient
> >>> way around? I found the np.histogram function painfully slow.
> >>>
> >>> Below a simple script, that demonstrates bincount failing with a memory
> >>> error on big numbers
> >>>
> >>> import numpy as np
> >>>
> >>> x = np.array((30e9,)).astype(int)
> >>> np.bincount(x)
> >>>
> >>>
> >>> Any good idea how to work around it. My arrays contain somewhat 50M
> >>> entries in the range from 0 to 30e9. And I would like to have them
> >>> bincounted...
> >>
> >> You need a sparse data structure, then. Are you sure you even have
> duplicates?
> >>
> >> Anyways, I won't work out all of the details, but let me sketch
> >> something that might get you your answers. First, sort your array.
> >> Then use np.not_equal(x[:-1], x[1:]) as a mask on np.arange(1,len(x))
> >> to find the indices where each sorted value changes over to the next.
> >> The np.diff() of that should give you the size of each. Use np.unique
> >> to get the sorted unique values to match up with those sizes.
> >>
> >> Fixing all of the off-by-one errors and dealing with the boundary
> >> conditions correctly is left as an exercise for the reader.
> >>
> >
> > ?? I suspect that this mail was meant to end up in the thread about
> > sparse array data?
>
> No, I am responding to you.
>
>
Hi Robert (Elsner),

Just to expand a bit on Robert Kern's explanation: Your problem is only
partly related to Ticket #225 <http://projects.scipy.org/numpy/ticket/225>.
Even if that is fixed, you won't be able to call `bincount` with an array
containing `30e9` unless you implement something using sparse arrays
because `bincount` wants return an array that's `30e9 + 1` in length, which
isn't going to happen.

-Tony
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120503/11b3f1e2/attachment.html>


More information about the NumPy-Discussion mailing list