[Numpy-discussion] Status of np.bincount
Wes McKinney
wesmckinn at gmail.com
Thu May 3 13:36:23 EDT 2012
On Thu, May 3, 2012 at 12:51 PM, Tony Yu <tsyu80 at gmail.com> wrote:
>
>
> On Thu, May 3, 2012 at 9:57 AM, Robert Kern <robert.kern at gmail.com> wrote:
>>
>> On Thu, May 3, 2012 at 2:50 PM, Robert Elsner <mlist at re-factory.de> wrote:
>> >
>> > Am 03.05.2012 15:45, schrieb Robert Kern:
>> >> On Thu, May 3, 2012 at 2:24 PM, Robert Elsner <mlist at re-factory.de>
>> >> wrote:
>> >>> Hello Everybody,
>> >>>
>> >>> is there any news on the status of np.bincount with respect to "big"
>> >>> numbers? It seems I have just been bitten by #225. Is there an
>> >>> efficient
>> >>> way around? I found the np.histogram function painfully slow.
>> >>>
>> >>> Below a simple script, that demonstrates bincount failing with a
>> >>> memory
>> >>> error on big numbers
>> >>>
>> >>> import numpy as np
>> >>>
>> >>> x = np.array((30e9,)).astype(int)
>> >>> np.bincount(x)
>> >>>
>> >>>
>> >>> Any good idea how to work around it. My arrays contain somewhat 50M
>> >>> entries in the range from 0 to 30e9. And I would like to have them
>> >>> bincounted...
>> >>
>> >> You need a sparse data structure, then. Are you sure you even have
>> >> duplicates?
>> >>
>> >> Anyways, I won't work out all of the details, but let me sketch
>> >> something that might get you your answers. First, sort your array.
>> >> Then use np.not_equal(x[:-1], x[1:]) as a mask on np.arange(1,len(x))
>> >> to find the indices where each sorted value changes over to the next.
>> >> The np.diff() of that should give you the size of each. Use np.unique
>> >> to get the sorted unique values to match up with those sizes.
>> >>
>> >> Fixing all of the off-by-one errors and dealing with the boundary
>> >> conditions correctly is left as an exercise for the reader.
>> >>
>> >
>> > ?? I suspect that this mail was meant to end up in the thread about
>> > sparse array data?
>>
>> No, I am responding to you.
>>
>
> Hi Robert (Elsner),
>
> Just to expand a bit on Robert Kern's explanation: Your problem is only
> partly related to Ticket #225. Even if that is fixed, you won't be able to
> call `bincount` with an array containing `30e9` unless you implement
> something using sparse arrays because `bincount` wants return an array
> that's `30e9 + 1` in length, which isn't going to happen.
>
> -Tony
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
hi Robert,
I suggest you try the value_counts instance method on pandas.Series:
In [9]: ints = np.random.randint(0, 30e9, size=100000)
In [10]: all_ints = Series(ints.repeat(500))
In [11]: all_ints.value_counts()
Out[11]:
16420382874 500
7147863689 500
4019588415 500
17462388002 500
11956087699 500
14888898988 500
3811318398 500
6333517765 500
16077665866 500
17559759721 500
5898309082 500
25213150655 500
17877388690 500
3122117900 500
6242860212 500
...
6344689036 500
16817048573 500
16361777055 500
4376828961 500
15910505187 500
12051499627 500
23857610954 500
24557975709 500
28135006018 500
1661624653 500
6747702840 500
24601775145 500
7290769930 500
9417075109 500
12071596222 500
Length: 100000
This method uses a C hash table and takes about 1 second to compute
the bin counts for 50mm entries and 100k unique values.
- Wes
More information about the NumPy-Discussion
mailing list