[Numpy-discussion] adding a cut function to numpy

Mon Apr 16 18:01:29 EDT 2012

On Mon, Apr 16, 2012 at 5:51 PM, Tony Yu <tsyu80 at gmail.com> wrote:
>
>
> On Mon, Apr 16, 2012 at 5:27 PM, Skipper Seabold <jsseabold at gmail.com>
> wrote:
>>
>> Hi,
>>
>> I have a pull request here [1] to add a cut function similar to R's
>> [2]. It seems there are often requests for similar functionality. It's
>> something I'm making use of for my own work and would like to use in
>> statstmodels and in generating instances of pandas' Factor class, but
>> is this generally something people would find useful to warrant its
>> inclusion in numpy? It will be even more useful I think with an enum
>> dtype in numpy.
>>
>> If you aren't familiar with cut, here's a potential use case. Going
>> from a continuous to a categorical variable.
>>
>> Given a continuous variable
>>
>> [~/]
>> [8]: age = np.random.randint(15,70, size=100)
>>
>> [~/]
>> [9]: age
>> [9]:
>> array([58, 32, 20, 25, 34, 69, 52, 27, 20, 23, 51, 61, 39, 54, 39, 44, 27,
>>       17, 29, 18, 66, 25, 44, 21, 54, 32, 50, 60, 25, 41, 68, 25, 42, 69,
>>       50, 69, 24, 69, 69, 48, 30, 20, 18, 15, 50, 48, 44, 27, 57, 52, 40,
>>       27, 58, 45, 44, 32, 54, 19, 36, 32, 55, 17, 55, 15, 19, 29, 22, 25,
>>       36, 44, 29, 53, 37, 31, 51, 39, 21, 66, 25, 26, 20, 17, 41, 50, 27,
>>       23, 62, 69, 65, 34, 38, 61, 39, 34, 38, 35, 18, 36, 29, 26])
>>
>> Give me a variable where people are in age groups (lower bound is not
>> inclusive)
>>
>> [~/]
>> [10]: groups = [14, 25, 35, 45, 55, 70]
>>
>> [~/]
>> [11]: age_cat = np.cut(age, groups)
>>
>> [~/]
>> [12]: age_cat
>> [12]:
>> array([5, 2, 1, 1, 2, 5, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1,
>> 3,
>>       1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 5, 4, 5, 1, 5, 5, 4, 2, 1, 1, 1, 4, 4,
>>       3, 2, 5, 4, 3, 2, 5, 3, 3, 2, 4, 1, 3, 2, 4, 1, 4, 1, 1, 2, 1, 1, 3,
>>       3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 5, 5, 2, 3, 5,
>>       3, 2, 3, 2, 1, 3, 2, 2])
>>
>> Skipper
>>
>> [1] https://github.com/numpy/numpy/pull/248
>> [2] http://stat.ethz.ch/R-manual/R-devel/library/base/html/cut.html
>
>
> Is this the same as `np.searchsorted` (with reversed arguments)?
>
> In [292]: np.searchsorted(groups, age)
> Out[292]:
> array([5, 2, 1, 1, 2, 5, 4, 2, 1, 1, 4, 5, 3, 4, 3, 3, 2, 1, 2, 1, 5, 1, 3,
>        1, 4, 2, 4, 5, 1, 3, 5, 1, 3, 5, 4, 5, 1, 5, 5, 4, 2, 1, 1, 1, 4, 4,
>        3, 2, 5, 4, 3, 2, 5, 3, 3, 2, 4, 1, 3, 2, 4, 1, 4, 1, 1, 2, 1, 1, 3,
>        3, 2, 4, 3, 2, 4, 3, 1, 5, 1, 2, 1, 1, 3, 4, 2, 1, 5, 5, 5, 2, 3, 5,
>        3, 2, 3, 2, 1, 3, 2, 2])
>

That's news to me, and I don't know how I missed it. It looks like
there is overlap, but cut will also do binning for equal width
categorization

[~/]
[21]: np.cut(age, 6)
[21]:
array([5, 2, 1, 2, 3, 6, 5, 2, 1, 1, 4, 6, 3, 5, 3, 4, 2, 1, 2, 1, 6, 2, 4,
       1, 5, 2, 4, 5, 2, 3, 6, 2, 3, 6, 4, 6, 1, 6, 6, 4, 2, 1, 1, 1, 4, 4,
       4, 2, 5, 5, 3, 2, 5, 4, 4, 2, 5, 1, 3, 2, 5, 1, 5, 1, 1, 2, 1, 2, 3,
       4, 2, 5, 3, 2, 4, 3, 1, 6, 2, 2, 1, 1, 3, 4, 2, 1, 6, 6, 6, 3, 3, 6,
       3, 3, 3, 3, 1, 3, 2, 2])

and explicitly handles the case with constant x

[~/]
[26]: x = np.ones(100)*6

[~/]
[27]: np.cut(x, 5)
[27]:
array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3])

I guess I could patch searchsorted. Thoughts?

Skipper