Hi all, I have started working on a NEP for adding an enumerated type to NumPy. It is on my GitHub: https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst It is still very rough, and incomplete in places. But I would like to get feedback sooner rather than later in order to refine it. In particular there are a few questions inline in the document that I would like input on. Any comments, suggestions, questions, concerns, etc. are very welcome. Thanks, Bryan
On Fri, Mar 9, 2012 at 4:55 PM, Bryan Van de Ven <bryanv@continuum.io> wrote:
Hi all,
I have started working on a NEP for adding an enumerated type to NumPy. It is on my GitHub:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
It is still very rough, and incomplete in places. But I would like to get feedback sooner rather than later in order to refine it. In particular there are a few questions inline in the document that I would like input on. Any comments, suggestions, questions, concerns, etc. are very welcome.
Hi Bryan, That's excellent, an enumerated type would be very useful. From a quick read, though, what I'd really like to see is some discussion of the goals here -- like some example situations where you see these being used, and the problems they're intended to solve? Because for example, C "enums" are designed to solve a completely different problem than something like an R "factor", and off the top of my head I don't know how well either maps onto hdf5 enumerated types. Another example is that I can't tell from the document what the motivation for having both "open" and "closed" enums is? (Also, general question: is there some technical advantage to being able to represent more complicated dtypes as strings, that justifies making up these mini-languages like "enum:uint16[A, B, C, D, E:128]"? It can't be necessary for pickling or anything, right, since AFAICT there's already no string representation for structured dtypes? It just seems like it'd be simpler and more elegant to use some Python syntax like 'dtype(Enum(["a", "b", "c"], storage=np.uint16))' instead of writing a tiny one-off parser and wedging what's really a data structure into a string, but I may be missing something.) -- Nathaniel
Hi, On Sat, Mar 10, 2012 at 3:25 AM, Bryan Van de Ven <bryanv@continuum.io> wrote:
Hi all,
I have started working on a NEP for adding an enumerated type to NumPy. It is on my GitHub:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
It is still very rough, and incomplete in places. But I would like to get feedback sooner rather than later in order to refine it. In particular there are a few questions inline in the document that I would like input on. Any comments, suggestions, questions, concerns, etc. are very welcome.
"t = np.dtype('enum', map=(n,v))" ^ Is this supposed to be indicating 'this is an enum with values ranging between n and v'? It could be a bit more clear. Is it possible to partially define an enum? That is, give the maximum and minimum values, and only some of the enumeration value:name mappings? For example, an enum where 0 means 'n/a', +n means 'Type A Object #(n-1)' and -n means 'Type B Object #(abs(n) - 1)'. I just want to map the non-scalar values, while having a way to avoid treating valid scalar values (eg +64) as out-of-range. Example of what I mean: "t = np.dtype('enum[N_A:0]', range = (-127, 127))" (defined values being printed as a string, undefined being printed as a number.) David
On Fri, Mar 9, 2012 at 5:48 PM, David Gowers (kampu) <00ai99@gmail.com> wrote:
Hi,
On Sat, Mar 10, 2012 at 3:25 AM, Bryan Van de Ven <bryanv@continuum.io> wrote:
Hi all,
I have started working on a NEP for adding an enumerated type to NumPy. It is on my GitHub:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
It is still very rough, and incomplete in places. But I would like to get feedback sooner rather than later in order to refine it. In particular there are a few questions inline in the document that I would like input on. Any comments, suggestions, questions, concerns, etc. are very welcome.
"t = np.dtype('enum', map=(n,v))"
^ Is this supposed to be indicating 'this is an enum with values ranging between n and v'? It could be a bit more clear.
Is it possible to partially define an enum? That is, give the maximum and minimum values, and only some of the enumeration value:name mappings? For example, an enum where 0 means 'n/a', +n means 'Type A Object #(n-1)' and -n means 'Type B Object #(abs(n) - 1)'. I just want to map the non-scalar values, while having a way to avoid treating valid scalar values (eg +64) as out-of-range. Example of what I mean:
"t = np.dtype('enum[N_A:0]', range = (-127, 127))" (defined values being printed as a string, undefined being printed as a number.)
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
I'll have to think about this (a little brain dump here). I have many use cases in pandas where this would be useful which are basically direct translations of R's factor data type. Note that R always coerces the levels (the unique values) AFAICT to string type. However, mapping back to a well-dtyped array is important, too. So the temptation might be to do something like this: ndarray: dtype storage type (uint32 or something) mapping : khash with type PyObject* -> uint32 Now, one problem with this is that you want the mapping + dtype to be invertible (otherwise you're left doing some type inference). The way that I implement the mapping is to restrict the labeling to be from 0 to N - 1 which makes things easier. If we decide that having an explicit value mapping The nice thing about this is that the same set of core algorithms can be used to fix numpy.unique. For example you would like to be able to do: enum_arr = np.enum(arr) (this seems like a reasonable API to me) and that is a direct equivalent of R's factor function. You need to be able to pass an explicit ordering when calling the enum/factor function. If not specified, you should have an option to either sort or not-- for example suppose you convert an array of 1 million integers to enum but you don't particularly care about the uniques (which could be very large, up to the size of the array) being ordered (no need to pay N log N for large N). One nice thing about khash is that it can be serialized fairly easily. Have you looked much at how I use enum-like ideas in pandas? It would be great if I could offload some of this data algorithmic work to NumPy. We will want the enum data type to integrate with text file readers-- if you "factorize as you go" you can drastically reduce the memory usage of a structured array (or pandas DataFrame) columns with long-ish strings and relatively few unique values. - Wes
Hi Wes, On Mon, Mar 12, 2012 at 9:33 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
Now, one problem with this is that you want the mapping + dtype to be invertible (otherwise you're left doing some type inference). The way that I implement the mapping is to restrict the labeling to be from 0 to N - 1 which makes things easier.
If we decide that having an explicit value mapping (...?)
You might want to finish whatever thought that was :)
On Fri, Mar 9, 2012 at 8:55 AM, Bryan Van de Ven <bryanv@continuum.io>wrote:
Hi all,
I have started working on a NEP for adding an enumerated type to NumPy. It is on my GitHub:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
It is still very rough, and incomplete in places. But I would like to get feedback sooner rather than later in order to refine it. In particular there are a few questions inline in the document that I would like input on. Any comments, suggestions, questions, concerns, etc. are very welcome.
This looks like a great start to me. I think the open/closed enum distinction will need to be explored a little bit more, because it interacts with dtype immutability/hashability. Do you know if there are any examples of Python objects in the wild that dynamically convert from not being hashable (i.e. raising an exception if used as a dict key) to become hashable? It might be worth adding a section which briefly compares and contrasts the proposed functionality with enums in various programming languages. Here are two links I found to try and get an idea: MS on C# enum usage: http://msdn.microsoft.com/en-us/library/cc138362.aspx Wikipedia on C++ enum class: http://en.wikipedia.org/wiki/C%2B%2B11#Strongly_typed_enumerations For example, the C# enum has a way to enable a "flags" mode, which will create successive powers of 2. This may not be a feature NumPy needs, but if people are finding it useful in C#, maybe it would be useful here too. Cheers, Mark
Thanks,
Bryan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On 03/13/2012 06:44 PM, Mark Wiebe wrote:
On Fri, Mar 9, 2012 at 8:55 AM, Bryan Van de Ven <bryanv@continuum.io <mailto:bryanv@continuum.io>> wrote:
Hi all,
I have started working on a NEP for adding an enumerated type to NumPy. It is on my GitHub:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
It is still very rough, and incomplete in places. But I would like to get feedback sooner rather than later in order to refine it. In particular there are a few questions inline in the document that I would like input on. Any comments, suggestions, questions, concerns, etc. are very welcome.
This looks like a great start to me.
I think the open/closed enum distinction will need to be explored a little bit more, because it interacts with dtype immutability/hashability. Do you know if there are any examples of Python objects in the wild that dynamically convert from not being hashable (i.e. raising an exception if used as a dict key) to become hashable?
In Sage, the matrix objects are mutable when constructed, and you can set_immutable to make them immutable. The way I look at that though is that it is part of the construction phase of the object, you'd typically construct, fill it in, then set_immutable (to finish construction), then use it. set/frozenset is an example of the opposite, and a design I personally like better (i.e., "frozen_dtype" :-)). Dag
It might be worth adding a section which briefly compares and contrasts the proposed functionality with enums in various programming languages. Here are two links I found to try and get an idea:
MS on C# enum usage: http://msdn.microsoft.com/en-us/library/cc138362.aspx Wikipedia on C++ enum class: http://en.wikipedia.org/wiki/C%2B%2B11#Strongly_typed_enumerations
For example, the C# enum has a way to enable a "flags" mode, which will create successive powers of 2. This may not be a feature NumPy needs, but if people are finding it useful in C#, maybe it would be useful here too.
Cheers, Mark
Thanks,
Bryan _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Mar 14, 2012 at 1:44 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:
On Fri, Mar 9, 2012 at 8:55 AM, Bryan Van de Ven <bryanv@continuum.io> wrote:
Hi all,
I have started working on a NEP for adding an enumerated type to NumPy. It is on my GitHub:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
It is still very rough, and incomplete in places. But I would like to get feedback sooner rather than later in order to refine it. In particular there are a few questions inline in the document that I would like input on. Any comments, suggestions, questions, concerns, etc. are very welcome.
This looks like a great start to me.
I think the open/closed enum distinction will need to be explored a little bit more, because it interacts with dtype immutability/hashability. Do you know if there are any examples of Python objects in the wild that dynamically convert from not being hashable (i.e. raising an exception if used as a dict key) to become hashable?
I haven't run into any... Thinking about it, I'm not sure I have any use case for this type being mutable. Maybe someone else can think of one? The first case that came to mind was in reading a large text file, where you want to (1) auto-create an enum, (2) use a pre-allocated array, and (3) don't know ahead of time what the levels are: a = np.empty(lines_in_file, dtype=np.dtype(Enum())) for i, line in enumerate(f): field = line.split()[0] a.dtype.add_level(field) a[i] = field a.dtype.seal() But really this is just can be done just as easily and efficiently without a mutable dtype: a = np.empty(lines_in_file, dtype=np.int32) intern_table = {} next_level = 0 for i, line in enumerate(f): field = line.split()[0] val = intern_table.setdefault(field, next_level) if val == next_level: next_level += 1 a[i] = val a = a.view(dtype=np.dtype(Enum(map=intern_table))) I notice that the HDF5 C library has a concept of open versus closed enums, but I can't tell from the documentation at hand why this is; it looks like it might just be a limitation of the implementation. (Like, a workaround for C's lack of a standard mapping type, which makes it inconvenient to pass in all the mappings in to a single API call.)
It might be worth adding a section which briefly compares and contrasts the proposed functionality with enums in various programming languages. Here are two links I found to try and get an idea:
MS on C# enum usage: http://msdn.microsoft.com/en-us/library/cc138362.aspx Wikipedia on C++ enum class: http://en.wikipedia.org/wiki/C%2B%2B11#Strongly_typed_enumerations
For example, the C# enum has a way to enable a "flags" mode, which will create successive powers of 2. This may not be a feature NumPy needs, but if people are finding it useful in C#, maybe it would be useful here too.
There's also a long, ongoing debate about how to do enums in Python -- e.g.: http://www.python.org/dev/peps/pep-0354/ http://pypi.python.org/pypi/enum/ http://pypi.python.org/pypi/enum_meta/ http://pypi.python.org/pypi/flufl.enum/ http://pypi.python.org/pypi/lazr.enum/ http://pypi.python.org/pypi/pyutilib.enum/ http://pypi.python.org/pypi/coding/ http://stackoverflow.com/questions/36932/whats-the-best-way-to-implement-an-... I guess Guido likes flufl.enum: http://mail.python.org/pipermail/python-ideas/2011-July/010909.html BUT, I'm not sure any of this is relevant at all. "Enums" are a programming language feature that are, first and foremost, about injecting names into your code's namespace. What I'm hoping to see is a dtype for holding categorical data, similar to an R "factor" http://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html https://svn.r-project.org/R/trunk/src/library/base/R/factor.R (NB: This is GPL code if anyone is paranoid about contamination, but also the most complete API description available) or an HDF5 "enum" http://www.hdfgroup.org/HDF5/doc/H5.user/Datatypes.html#Datatypes_Enum I believe pandas has some functionality along these lines too, though I can't find it in the online docs -- hopefully Wes will fill us in. These are basically objects that act for most purposes like string arrays, but in which all strings are required to come from a finite, specified list. This list acts like some metadata attached to the array; it's order may or may not be significant. And they're implemented internally as integer arrays. I'm not sure what it would even mean to treat this kind of data as "flags", since you can't take the bitwise-or of two strings... -- Nathaniel
On Wed, Mar 14, 2012 at 1:44 AM, Mark Wiebe <mwwiebe@gmail.com> wrote:
On Fri, Mar 9, 2012 at 8:55 AM, Bryan Van de Ven <bryanv@continuum.io> wrote:
Hi all,
I have started working on a NEP for adding an enumerated type to NumPy. It is on my GitHub:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
It is still very rough, and incomplete in places. But I would like to get feedback sooner rather than later in order to refine it. In particular there are a few questions inline in the document that I would like input on. Any comments, suggestions, questions, concerns, etc. are very welcome.
This looks like a great start to me.
I think the open/closed enum distinction will need to be explored a
bit more, because it interacts with dtype immutability/hashability. Do you know if there are any examples of Python objects in the wild that dynamically convert from not being hashable (i.e. raising an exception if used as a dict key) to become hashable?
I haven't run into any...
Thinking about it, I'm not sure I have any use case for this type being mutable. Maybe someone else can think of one? The first case that came to mind was in reading a large text file, where you want to (1) auto-create an enum, (2) use a pre-allocated array, and (3) don't know ahead of time what the levels are:
a = np.empty(lines_in_file, dtype=np.dtype(Enum())) for i, line in enumerate(f): field = line.split()[0] a.dtype.add_level(field) a[i] = field a.dtype.seal()
But really this is just can be done just as easily and efficiently without a mutable dtype:
a = np.empty(lines_in_file, dtype=np.int32) intern_table = {} next_level = 0 for i, line in enumerate(f): field = line.split()[0] val = intern_table.setdefault(field, next_level) if val == next_level: next_level += 1 a[i] = val a = a.view(dtype=np.dtype(Enum(map=intern_table)))
I notice that the HDF5 C library has a concept of open versus closed enums, but I can't tell from the documentation at hand why this is; it looks like it might just be a limitation of the implementation. (Like, a workaround for C's lack of a standard mapping type, which makes it inconvenient to pass in all the mappings in to a single API call.)
It might be worth adding a section which briefly compares and contrasts
On Thursday, March 15, 2012, Nathaniel Smith <njs@pobox.com> wrote: little the
proposed functionality with enums in various programming languages. Here are two links I found to try and get an idea:
MS on C# enum usage: http://msdn.microsoft.com/en-us/library/cc138362.aspx Wikipedia on C++ enum class: http://en.wikipedia.org/wiki/C%2B%2B11#Strongly_typed_enumerations
For example, the C# enum has a way to enable a "flags" mode, which will create successive powers of 2. This may not be a feature NumPy needs, but if people are finding it useful in C#, maybe it would be useful here too.
There's also a long, ongoing debate about how to do enums in Python -- e.g.: http://www.python.org/dev/peps/pep-0354/ http://pypi.python.org/pypi/enum/ http://pypi.python.org/pypi/enum_meta/ http://pypi.python.org/pypi/flufl.enum/ http://pypi.python.org/pypi/lazr.enum/ http://pypi.python.org/pypi/pyutilib.enum/ http://pypi.python.org/pypi/coding/
http://stackoverflow.com/questions/36932/whats-the-best-way-to-implement-an-...
I guess Guido likes flufl.enum: http://mail.python.org/pipermail/python-ideas/2011-July/010909.html
BUT, I'm not sure any of this is relevant at all. "Enums" are a programming language feature that are, first and foremost, about injecting names into your code's namespace. What I'm hoping to see is a dtype for holding categorical data, similar to an R "factor" http://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html https://svn.r-project.org/R/trunk/src/library/base/R/factor.R (NB: This is GPL code if anyone is paranoid about contamination, but also the most complete API description available) or an HDF5 "enum" http://www.hdfgroup.org/HDF5/doc/H5.user/Datatypes.html#Datatypes_Enum I believe pandas has some functionality along these lines too, though I can't find it in the online docs -- hopefully Wes will fill us in.
These are basically objects that act for most purposes like string arrays, but in which all strings are required to come from a finite, specified list. This list acts like some metadata attached to the array; it's order may or may not be significant. And they're implemented internally as integer arrays.
I'm not sure what it would even mean to treat this kind of data as "flags", since you can't take the bitwise-or of two strings...
-- Nathaniel
I guess my problem is that this isn't _quite_ like an enum that I am familiar with (but not quite unlike it either). Should we call it "factor", to avoid confusion or are there going to be too many that won't know what that is, but would be drawn in by a name of "enum"? Just a thought. Ben Root
On Thu, Mar 15, 2012 at 4:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
I'm not sure what it would even mean to treat this kind of data as "flags", since you can't take the bitwise-or of two strings...
This makes a more sense outside of ndarrays, where you would do something like: enum FLAG0 = 1, FLAG1 = 2, FLAG2 = 4 do_something(data, mode=FLAG0 & FLAG2) The enum is therefore just a handle for its numerical value. While it may not be that useful in an array, I think Mark was just pointing out that there may be other similar use cases, such as enumerating from 0 to N-1, or in reverse from N-1 down to 0, or in steps of 2, or in powers of 2, etc. Stéfan
On Mar 16, 2012 1:02 AM, "Stéfan van der Walt" <stefan <stefan@sun.ac.za>@<stefan@sun.ac.za> sun.ac.za <stefan@sun.ac.za>> wrote:
On Thu, Mar 15, 2012 at 4:02 PM, Nathaniel Smith <njs <njs@pobox.com>@<njs@pobox.com>
I'm not sure what it would even mean to treat this kind of data as "flags", since you can't take the bitwise-or of two strings...
This makes a more sense outside of ndarrays, where you would do something
pobox.com <njs@pobox.com>> wrote: like:
enum FLAG0 = 1, FLAG1 = 2, FLAG2 = 4 do_something(data, mode=FLAG0 & FLAG2)
The enum is therefore just a handle for its numerical value. While it may not be that useful in an array, I think Mark was just pointing out that there may be other similar use cases, such as enumerating from 0 to N-1, or in reverse from N-1 down to 0, or in steps of 2, or in powers of 2, etc.
Right, there may be. But are there? That's the question :-) It looks like R doesn't support anything except 1, ..., N numbering. There's really no reason it would either, since in their design the underlying integer values are almost entirely hidden from users. You could get at them if you wanted, but I bet less than 1% of users are even aware that factors and integers have anything to do with each other. Factors are just documented to be a way to store an array of strings drawn from a limited ordered list. (The ordering is important for things like polynomial coding and treatment versus baseline coding.) HDF5 supports arbitrary symbol<->integer mappings. 0, ..., N-1 coding makes the common problem of creating an indicator matrix very convenient: ind = np.zeros((enum_a.length, len(enum_.dtype.levels)), dtype=bool) ind[:, enum_a.view(dtype=np.int32)] = True But we can't restrict ourselves to only this coding if we want compatibility with HDF5 or R (because R is 1-based). So I guess supporting arbitrary mappings is worth it - though I doubt this flexibility will be used much. I'm curious if anyone can think of a reason they'd use it besides interoperability. Cheers, - Nathaniel
Hi all, I have spent some time thinking about things, and discussing them with folks nearby. I actually got to wondering whether we really need new dtypes for this. It seems like enumerated values or factor levels could be cast as an annotation or metadata that could be attached to any existing integral dtypes. It spells differently enough that I have put up an alternate version that reflects this notion. I'd like to see what folks think of this direction: https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum_alt.rst So this would require adding machinery to existing dtypes to behave properly when there is factor metadata present. Perhaps that is not an acceptable trade-off, but it seems worth discussing. I think a very similar approach could be used to add categorical ranges to any numerical or string types (I think they are called "shingles" in R?) Please let me know what you think. Bryan
Hi all,
I have spent some time thinking about things, and discussing them with folks nearby. I actually got to wondering whether we really need new dtypes for this. It seems like enumerated values or factor levels could be cast as an annotation or metadata that could be attached to any existing integral dtypes. It spells differently enough that I have put up an alternate version that reflects this notion. I'd like to see what folks think of this direction:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum_alt.rst
So this would require adding machinery to existing dtypes to behave
On Fri, Mar 16, 2012 at 4:26 PM, Bryan Van de Ven <bryanv@continuum.io> wrote: properly
when there is factor metadata present. Perhaps that is not an acceptable trade-off, but it seems worth discussing.
I took a look at this, but I think something was lost in the translation from your head to text :-). Your description here makes it sound like what's different about this proposal is that there's very different underlying mechanics, but the enum_alt file just seems to describe an alternative and more-or-less equivalent user-level API. Unless you told me, I would have assumed that it just created a new dtype, rather than modified existing ones. What mechanism are you thinking of? Or did I miss something?
I think a very similar approach could be used to add categorical ranges to any numerical or string types (I think they are called "shingles" in R?)
A 'shingle' is a way of mapping (floating point) numbers into categories. However, they generally allow a single number to fall into multiple categories. So for example, you might take these data points: 1 2 3 4 5 6 7 8 9 10 11 And divide them into categories A, B, C like this: 1 2 3 4 5 6 7 8 9 10 11 AAAAAAAAAAAAA BBBBBBBBBBBBB CCCCCCCCCCCCCCC Which is why they're called "shingles" :-) http://www.floridadisaster.org/hrg/images/roofs/shingle_loose_tab_large.jpg This can be a very convenient data structure for various sorts of visualizations, but I'm not sure how it would make sense to integrate it into basic numerical types. R has a more basic function called 'cut' which takes a numerical array plus some specified breakpoints, and returns a factor array. But that's a simple utility function that doesn't need any special features in the underlying representation. -- Nathaniel
participants (8)
-
Benjamin Root -
Bryan Van de Ven -
Dag Sverre Seljebotn -
David Gowers (kampu) -
Mark Wiebe -
Nathaniel Smith -
Stéfan van der Walt -
Wes McKinney