[Numpy-discussion] Enum/Factor NEP (now with code)

Wed Jun 13 17:19:57 EDT 2012

On 6/13/12 1:54 PM, Wes McKinney wrote:
> OK, I need to spend some time on this as it will directly impact me.
> Random thoughts here.
>
> It looks like the levels can only be strings. This is too limited for
> my needs. Why not support all possible NumPy dtypes? In pandas world,
> the levels can be any unique Index object (note, I'm going to change
> the name of the Factor class to Categorical before 0.8.0 final per
> discussion with Nathaniel):

The current for-discussion prototype currently only supports strings. I 
had mentioned integral levels in the NEP but wanted to get more feedback 
first. It looks like you are using intervals as levels in things like 
qcut? This would add some complexity. I can think of a couple of 
possible approaches I will have to try a few of them out to see what 
would make the most sense.

> The API for constructing an enum/factor/categorical array from fixed
> levels and an array of labels seems somewhat weak to me. A very common
> scenario is to need to construct a factor from an array of integers
> with an associated array of levels:
>
>
> In [13]: labels
> Out[13]:
> array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
>         1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
>         0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
>         0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
>         7, 9, 7, 3, 3, 0, 4, 4])
>
> In [14]: levels
> Out[14]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>
> In [15]: Factor(labels, levels)
> Out[15]:
> Factor:
> array([6, 7, 3, 8, 8, 6, 7, 4, 8, 4, 2, 8, 8, 4, 8, 8, 1, 9, 5, 9, 6, 5, 7,
>         1, 6, 5, 2, 0, 4, 4, 1, 8, 6, 0, 1, 5, 9, 6, 0, 2, 1, 5, 8, 9, 6, 8,
>         0, 1, 9, 5, 8, 6, 3, 4, 3, 3, 8, 7, 8, 2, 9, 8, 9, 9, 5, 0, 5, 2, 1,
>         0, 2, 2, 0, 5, 4, 7, 6, 5, 0, 7, 3, 5, 6, 0, 6, 2, 5, 1, 5, 6, 3, 8,
>         7, 9, 7, 3, 3, 0, 4, 4])
> Levels (10): array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

I originally had a very similar interface in the NEP. I was persuaded by 
Mark that this would be redundant:

In [10]: levels = np.factor(['a', 'b', 'c'])   # or levels = 
np.factor_array(['a', 'b', 'c', 'a', 'b']).dtype
In [11]: np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], levels)
Out[11]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], 
dtype='factor({'c': 2, 'a': 0, 'b': 1})')

This should also spell even more closely to your example as:

labels.astype(levels)

but I have not done much with casting yet, so this currently complains. 
However, would this satisfy your needs (modulo the separate question 
about more general integral or object levels)?

> What is the story for NA values (NaL?) in a factor array? I code them
> as -1 in the labels, though you could use INT32_MAX or something. This
> is very important in the context of groupby operations.
I am just using INT32_MIN at the moment.
> Are the levels ordered (Nathaniel brought this up already looks like)?
> It doesn't look like it. That is also necessary. You also need to be

They currently compare based on their value:

In [20]: arr = np.array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], 
np.factor({'c':0, 'b':1, 'a':2}))
In [21]: arr
Out[21]: array(['b', 'b', 'a', 'c', 'c', 'a', 'a', 'a', 'b'], 
dtype='factor({'c': 0, 'a': 2, 'b': 1})')
In [22]: arr.sort()
In [23]: arr
Out[23]: array(['c', 'c', 'b', 'b', 'b', 'a', 'a', 'a', 'a'], 
dtype='factor({'c': 0, 'a': 2, 'b': 1})')

> able to sort the levels (which is a relabeling, I have lots of code in
> use for this). In the context of groupby in pandas, when processing a
> key (array of values) to a factor to be used for aggregating some
> data, you have the option of returning an object that has the levels
> as observed in the data or sorting. Sorting can obviously be very
> expensive depending on the number of groups in the data
> (http://wesmckinney.com/blog/?p=437). Example:
>
> from pandas import DataFrame
> from pandas.util.testing import rands
> import numpy as np
>
> df = DataFrame({'key' : [rands(10) for _ in xrange(100000)] * 10,
>             'data' : np.random.randn(1000000)})
>
> In [32]: timeit df.groupby('key').sum()
> 1 loops, best of 3: 374 ms per loop
>
> In [33]: timeit df.groupby('key', sort=False).sum()
> 10 loops, best of 3: 185 ms per loop
>
> The "factorization time" for the `key` column dominates the runtime;
> the factor is computed once then reused if you keep the GroupBy object
> around:
>
> In [36]: timeit grouped.sum()
> 100 loops, best of 3: 6.05 ms per loop
Just some numbers for comparison. Factorization times:

In [41]: lets = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
In [42]: levels = np.factor(lets)
In [43]: data = [lets[int(x)] for x in np.random.randn(1000000)]
In [44]: %timeit np.array(data, levels)
10 loops, best of 3: 137 ms per loop

And retrieving group indicies/summing:

In [8]: %timeit arr=='a'
1000 loops, best of 3: 1.52 ms per loop
In [10]: vals = np.random.randn(1000000)
In [20]: inds = [arr==x for x in lets]
In [23]: %timeit for ind in inds: vals[ind].sum()
10 loops, best of 3: 48.3 ms per loop

On my laptop your grouped.sum() took 22ms, so this is roughly off by 
about a factor of two. But we should compare it on the same hardware, 
and with the same level data types. There is no doubt room for 
improvement, though.

It would not be too bad to add some groupby functionality on top of 
this. I still need to add a mechanism for accessing and iterating over 
the levels.

> As another example of why ordered factors matter, consider a quantile
> cut (google for the "cut" function in R) function I wrote recently:
>
>
> In [40]: arr = Series(np.random.randn(1000000))
>
> In [41]: cats = qcut(arr, [0, 0.25, 0.5, 0.75, 1])
>
> In [43]: arr.groupby(cats).describe().unstack(0)
> Out[43]:
>         (-4.85, -0.673]  (-0.673, 0.00199]  (0.00199, 0.677]  (0.677, 4.914]
> count    250000.000000      250000.000000     250000.000000   250000.000000
> mean         -1.270623          -0.323092          0.326325        1.271519
> std           0.491317           0.193254          0.193044        0.490611
> min          -4.839798          -0.673224          0.001992        0.677177
> 25%          -1.533021          -0.487450          0.158736        0.888502
> 50%          -1.150136          -0.317501          0.320352        1.150480
> 75%          -0.887974          -0.155197          0.490456        1.534709
> max          -0.673224           0.001990          0.677176        4.913536
>
> If you don't have ordered levels, then the quantiles might come out in
> the wrong order depending on how the strings sort or fall out of the
> hash table.
We do have ordered levels. :) Now, there's currently no way to get a 
list of the levels, in order, but that should be trivial to add.

> Nathaniel: my experience (see blog posting above for a bit more) is
> that khash really crushes PyDict for two reasons: you can use it with
> primitive types and avoid boxing, and secondly you can preallocate.
> Its memory footprint with large hashtables is also a fraction of
> PyDict. The Python memory allocator is not problematic-- if you create
> millions of Python objects expect the RAM usage of the Python process
> to balloon absurdly.
>
> Anyway, this is exciting work assuming we get the API right and
> hitting all the use cases. On top of all this I am _very_ performance
> sensitive so you'll have to be pretty aggressive with benchmarking
> things. I have concerns about ceding control over critical
> functionality that I need for pandas (which has become a large and
> very important library these days for a lot of people), but as long as
> the pieces in NumPy are suitably mature and robust for me to switch to
> them eventually that would be great.
>
> I'll do my best to stay involved in the discussion, though I'm
> juggling a lot of things these days (e.g. I have the PyData book
> deadline approaching like a freight train).
>
> - Wes
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion