Hello, I am playing with adding an enum dtype to numpy (to get my feet wet in numpy really). I have looked at the https://github.com/martinling/numpy_quaternion and I feel comfortable with my understanding of adding a simple type to numpy in technical terms. I am mostly a C programmer and have programmed in Python but not at the level where my code wcould be considered "pretty" or maybe even "pythonic". I know enums from C and have browsed around a few python enum implementations online. Most of them use hash tables or lists to associate names to numbers - these approaches just feel "heavy" to me. What would be a proper "numpy approach" to this? I am looking mostly for direction and advice as I would like to do the work myself :-) Any input appreciated :-) Ognen
On Tue, Jan 3, 2012 at 12:46 PM, Ognen Duzlevski <ognen@enthought.com> wrote:
Hello,
I am playing with adding an enum dtype to numpy (to get my feet wet in numpy really). I have looked at the https://github.com/martinling/numpy_quaternion and I feel comfortable with my understanding of adding a simple type to numpy in technical terms.
I am mostly a C programmer and have programmed in Python but not at the level where my code wcould be considered "pretty" or maybe even "pythonic". I know enums from C and have browsed around a few python enum implementations online. Most of them use hash tables or lists to associate names to numbers - these approaches just feel "heavy" to me.
What would be a proper "numpy approach" to this? I am looking mostly for direction and advice as I would like to do the work myself :-)
Any input appreciated :-) Ognen _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
You should use a hash table internally in my opinion. I've started using khash from klib (https://github.com/attractivechaos/klib) which has excellent memory usage (more than 50% less than Python dict with large hash tables) and good performance characteristics. With the enum dtype you can avoid reference counting with primitive types, not sure about object dtype. If enum arrays are mutable this will be very tricky. - Wes
On 1/3/2012 10:46 AM, Ognen Duzlevski wrote:
Hello,
I am playing with adding an enum dtype to numpy (to get my feet wet in numpy really). I have looked at the https://github.com/martinling/numpy_quaternion and I feel comfortable with my understanding of adding a simple type to numpy in technical terms.
I am mostly a C programmer and have programmed in Python but not at the level where my code wcould be considered "pretty" or maybe even "pythonic". I know enums from C and have browsed around a few python enum implementations online. Most of them use hash tables or lists to associate names to numbers - these approaches just feel "heavy" to me.
What would be a proper "numpy approach" to this? I am looking mostly for direction and advice as I would like to do the work myself :-)
Any input appreciated :-) Ognen
Does "enumerate" (http://docs.python.org/library/functions.html#enumerate) work for you?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Jan 3, 2012 at 1:06 PM, Jim Vickroy <jim.vickroy@noaa.gov> wrote:
On 1/3/2012 10:46 AM, Ognen Duzlevski wrote:
Hello,
I am playing with adding an enum dtype to numpy (to get my feet wet in numpy really). I have looked at the https://github.com/martinling/numpy_quaternion and I feel comfortable with my understanding of adding a simple type to numpy in technical terms.
I am mostly a C programmer and have programmed in Python but not at the level where my code wcould be considered "pretty" or maybe even "pythonic". I know enums from C and have browsed around a few python enum implementations online. Most of them use hash tables or lists to associate names to numbers - these approaches just feel "heavy" to me.
What would be a proper "numpy approach" to this? I am looking mostly for direction and advice as I would like to do the work myself :-)
Any input appreciated :-) Ognen
Does "enumerate" (http://docs.python.org/library/functions.html#enumerate) work for you?
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
That's not exactly what he means. The R lingo for this concept is "factor" or a bit more common "categorical variable": http://stat.ethz.ch/R-manual/R-patched/library/base/html/factor.html FWIW R's factor type is implemented using hash tables. I do the same in pandas. - Wes
On Tue, Jan 3, 2012 at 12:46 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
On Tue, Jan 3, 2012 at 1:06 PM, Jim Vickroy <jim.vickroy@noaa.gov> wrote:
On 1/3/2012 10:46 AM, Ognen Duzlevski wrote:
Hello,
I am playing with adding an enum dtype to numpy (to get my feet wet in numpy really). I have looked at the https://github.com/martinling/numpy_quaternion and I feel comfortable with my understanding of adding a simple type to numpy in technical terms.
I am mostly a C programmer and have programmed in Python but not at the level where my code wcould be considered "pretty" or maybe even "pythonic". I know enums from C and have browsed around a few python enum implementations online. Most of them use hash tables or lists to associate names to numbers - these approaches just feel "heavy" to me.
What would be a proper "numpy approach" to this? I am looking mostly for direction and advice as I would like to do the work myself :-)
Any input appreciated :-) Ognen
Does "enumerate" (http://docs.python.org/library/functions.html#enumerate) work for you? That's not exactly what he means. The R lingo for this concept is "factor" or a bit more common "categorical variable":
http://stat.ethz.ch/R-manual/R-patched/library/base/html/factor.html
FWIW R's factor type is implemented using hash tables. I do the same in pandas.
- Wes
Wes, You are right, "categorical variable" is what I am after. Thanks for the pointer, I will go the klib route you suggested and see what comes out. I may be "old fashioned" a bit in the sense that adding dependencies on external libraries is something I am reluctant to do - this is why I said using hashes may have felt a bit "heavy". But that may be my shortcoming :-) Ognen
On 01/03/2012 06:46 PM, Ognen Duzlevski wrote:
Hello,
I am playing with adding an enum dtype to numpy (to get my feet wet in numpy really). I have looked at the https://github.com/martinling/numpy_quaternion and I feel comfortable with my understanding of adding a simple type to numpy in technical terms.
I am mostly a C programmer and have programmed in Python but not at the level where my code wcould be considered "pretty" or maybe even "pythonic". I know enums from C and have browsed around a few python enum implementations online. Most of them use hash tables or lists to associate names to numbers - these approaches just feel "heavy" to me.
If you want the enum values to be stored efficiently (using 1, 2 or 4-byte integers), and want a mapping between string names and such integers, then you need to map between them somehow, right? I.e., when printing the repr() of each element, you at least need a list in order to go from enum values to names (and that doesn't feel 'heavy' to me -- it's the minimal possible solution for the job!) It's unclear whether you mean heavy on the CPU, in the API, in the C code, or whatever, so difficult to give more feedback. As far as the API goes, you could probably do something like: colors = np.enum(['red', 'green', 'blue']) arr = np.asarray([colors.red, colors.red, colors.red, colors.blue]) assert arr[0] == colors.red assert np.all(arr.view(np.int8) == [0, 0, 0, 2]) So the strings are only needed in the API in the constructor of the enum type. They are needed there though. Dag Sverre
On Tue, Jan 3, 2012 at 9:46 AM, Ognen Duzlevski <ognen@enthought.com> wrote:
Hello,
I am playing with adding an enum dtype to numpy (to get my feet wet in numpy really). I have looked at the https://github.com/martinling/numpy_quaternion and I feel comfortable with my understanding of adding a simple type to numpy in technical terms.
Hi Ognen, I'm in the middle of an intercontinental move, so I can't help much, but I'd also love to see a proper enum/categorical type in numpy, so here are a few notes: - I wrote a simple cython implementation of this last year, which might be useful -- code attached. - The barrier I ran into, which you'll surely run into as well, is a flaw in the ufunc API in numpy. Currently, ufunc inner loops do not have any way to access the dtype of the array they are being called on. For most dtypes, this isn't an issue -- the inner loop for adding together int32's knows that it is being called on an array of int32's, it doesn't need to see the dtype to figure that out. But with enums, each array has a different set of possible categories, and these will be attached to the dtype object somehow. So if you want to do, say, equality comparison between an enum-array and a string-array: np.enumarray(["a"", "b", "c"]) == ["a", "c", "b"] -> np.array([True, False, True]) ...you can't actually make this work in current numpy. The solution is that the ufunc API needs to be changed to make dtype's somehow available to inner loops. (Probably by passing a pointer to the array object, like all the PyArray_ArrFuncs do.) See this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-August/052401.html - Both the statistical folk (pandas, statsmodels) and the hdf5 folk (pytables, h5py) have reasons to want better enum support. (Maybe there are other use cases too -- anyone I'm forgetting?) You should make sure to talk to both groups to make sure what you come up with will work for them. Cheers, -- Nathaniel
I am mostly a C programmer and have programmed in Python but not at the level where my code wcould be considered "pretty" or maybe even "pythonic". I know enums from C and have browsed around a few python enum implementations online. Most of them use hash tables or lists to associate names to numbers - these approaches just feel "heavy" to me.
What would be a proper "numpy approach" to this? I am looking mostly for direction and advice as I would like to do the work myself :-)
Any input appreciated :-) Ognen _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Nathaniel, On Tue, Jan 3, 2012 at 2:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Jan 3, 2012 at 9:46 AM, Ognen Duzlevski <ognen@enthought.com> wrote:
Hello,
I am playing with adding an enum dtype to numpy (to get my feet wet in numpy really). I have looked at the https://github.com/martinling/numpy_quaternion and I feel comfortable with my understanding of adding a simple type to numpy in technical terms.
Hi Ognen,
I'm in the middle of an intercontinental move, so I can't help much, but I'd also love to see a proper enum/categorical type in numpy, so here are a few notes:
- I wrote a simple cython implementation of this last year, which might be useful -- code attached.
- The barrier I ran into, which you'll surely run into as well, is a flaw in the ufunc API in numpy. Currently, ufunc inner loops do not have any way to access the dtype of the array they are being called on. For most dtypes, this isn't an issue -- the inner loop for adding together int32's knows that it is being called on an array of int32's, it doesn't need to see the dtype to figure that out. But with enums, each array has a different set of possible categories, and these will be attached to the dtype object somehow. So if you want to do, say, equality comparison between an enum-array and a string-array: np.enumarray(["a"", "b", "c"]) == ["a", "c", "b"] -> np.array([True, False, True]) ...you can't actually make this work in current numpy. The solution is that the ufunc API needs to be changed to make dtype's somehow available to inner loops. (Probably by passing a pointer to the array object, like all the PyArray_ArrFuncs do.)
See this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-August/052401.html
- Both the statistical folk (pandas, statsmodels) and the hdf5 folk (pytables, h5py) have reasons to want better enum support. (Maybe there are other use cases too -- anyone I'm forgetting?) You should make sure to talk to both groups to make sure what you come up with will work for them.
Cheers, -- Nathaniel
Thanks! The above input is exactly what I was looking for (in addition to my original question). This "corner case" knowledge is good to have ;) Ognen
A categorical type (or enum type) is an important dtype to add to NumPy. It would be very nice if the option existed to make the categorical dtype "dynamic" in that the categories can grow as more data is added or inserted into the array. This would effectively allow binning of data on insertion into the array. The option would need to exist to have both "fixed" and "dynamic" dtypes because there are important use-cases for both. -Travis On Jan 3, 2012, at 2:02 PM, Nathaniel Smith wrote:
On Tue, Jan 3, 2012 at 9:46 AM, Ognen Duzlevski <ognen@enthought.com> wrote:
Hello,
I am playing with adding an enum dtype to numpy (to get my feet wet in numpy really). I have looked at the https://github.com/martinling/numpy_quaternion and I feel comfortable with my understanding of adding a simple type to numpy in technical terms.
Hi Ognen,
I'm in the middle of an intercontinental move, so I can't help much, but I'd also love to see a proper enum/categorical type in numpy, so here are a few notes:
- I wrote a simple cython implementation of this last year, which might be useful -- code attached.
- The barrier I ran into, which you'll surely run into as well, is a flaw in the ufunc API in numpy. Currently, ufunc inner loops do not have any way to access the dtype of the array they are being called on. For most dtypes, this isn't an issue -- the inner loop for adding together int32's knows that it is being called on an array of int32's, it doesn't need to see the dtype to figure that out. But with enums, each array has a different set of possible categories, and these will be attached to the dtype object somehow. So if you want to do, say, equality comparison between an enum-array and a string-array: np.enumarray(["a"", "b", "c"]) == ["a", "c", "b"] -> np.array([True, False, True]) ...you can't actually make this work in current numpy. The solution is that the ufunc API needs to be changed to make dtype's somehow available to inner loops. (Probably by passing a pointer to the array object, like all the PyArray_ArrFuncs do.)
See this thread: http://mail.scipy.org/pipermail/numpy-discussion/2010-August/052401.html
- Both the statistical folk (pandas, statsmodels) and the hdf5 folk (pytables, h5py) have reasons to want better enum support. (Maybe there are other use cases too -- anyone I'm forgetting?) You should make sure to talk to both groups to make sure what you come up with will work for them.
Cheers, -- Nathaniel
I am mostly a C programmer and have programmed in Python but not at the level where my code wcould be considered "pretty" or maybe even "pythonic". I know enums from C and have browsed around a few python enum implementations online. Most of them use hash tables or lists to associate names to numbers - these approaches just feel "heavy" to me.
What would be a proper "numpy approach" to this? I am looking mostly for direction and advice as I would like to do the work myself :-)
Any input appreciated :-) Ognen _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion <npenum.pyx>_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (6)
-
Dag Sverre Seljebotn
-
Jim Vickroy
-
Nathaniel Smith
-
Ognen Duzlevski
-
Travis Oliphant
-
Wes McKinney