
On Tue, Jun 12, 2012 at 10:27 PM, Bryan Van de Ven <bryanv@continuum.io> wrote:
Hi all,
It has been some time, but I do have an update regarding this proposed feature. I thought it would be helpful to flesh out some parts of a possible implementation to learn what can be spelled reasonably in NumPy. Mark Wiebe helped out greatly in navigating the NumPy code codebase. Here is a link to my branch with this code;
https://github.com/bryevdv/numpy/tree/enum
and the updated NEP:
https://github.com/bryevdv/numpy/blob/enum/doc/neps/enum.rst
Not everything in the NEP is implemented (integral levels and natural naming in particular) and some parts definitely need more fleshing out. However, things currently work basically as described in the NEP, and there is also a small set of tests that demonstrate current usage. A few things will crash python (astype especially). More tests are needed. I would appreciate as much feedback and discussion as you can provide!
Hi Bryan, I skimmed over the diff: https://github.com/bryevdv/numpy/compare/master...enum It was a bit hard to read since it seems like about half the changes in that branch are datatime cleanups or something? I hope you'll separate those out -- it's much easier to review self-contained changes, and the more changes you roll together into a big lump, the more risk there is that they'll get lost all together.
From the updated NEP I actually understand the use case for "open types" now, so that's good :-). But I don't think they're actually workable, so that's bad :-(. The use case, as I understand it, is for when you want to extend the levels set on the fly as you read through a file. The problem with this is that it produces a non-deterministic level ordering, where level 0 is whatever was seen first in the file, level 1 is whatever was seen second, etc. E.g., say I have a CSV file I read in:
subject,initial_skill,skill_after_training 1,LOW,HIGH 2,LOW,LOW 3,HIGH,HIGH ... With the scheme described in the NEP, my initial_skill dtype will have levels ["LOW", "HIGH"], and by skill_after_training dtype will have levels ["HIGH","LOW"], which means that their storage will be incompatible, comparisons won't work (or will have to go through some nasty convert-to-string-and-back path), etc. Another situation where this will occur is if you have multiple data files in the same format; whether or not you're able to compare the data from them will depend on the order the data happens to occur in in each file. The solution is that whenever we automagically create a set of levels from some data, and the user hasn't specified any order, we should pick an order deterministically by sorting the levels. (This is also what R does. levels(factor(c("a", "b"))) -> "a", "b". levels(factor(c("b", "a"))) -> "a", "b".) I'm inclined to say therefore that we should just drop the "open type" idea, since it adds complexity but doesn't seem to actually solve the problem it's designed for. Can you explain why you're using khash instead of PyDict? It seems to add a *lot* of complexity -- like it seems like you're using about as many lines of code just marshalling data into and out of the khash as I used for my old npenum.pyx prototype (not even counting all the extra work required to , and AFAICT my prototype has about the same amount of functionality as this. (Of course that's not entirely fair, because I was working in Cython... but why not work in Cython?) And you'll need to expose a Python dict interface sooner or later anyway, I'd think? I can't tell if it's worth having categorical scalar types. What value do they provide over just using scalars of the level type? Terminology: I'd like to suggest we prefer the term "categorical" for this data, rather than "factor" or "enum". Partly this is because it makes my life easier ;-): https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J and partly because numpy has a very diverse set of users and I suspect that "categorical" will just be a more transparent name to those who aren't already familiar with the particular statistical and programming traditions that "factor" and "enum" come from. I'm disturbed to see you adding special cases to the core ufunc dispatch machinery for these things. I'm -1 on that. We should clean up the generic ufunc machinery so that it doesn't need special cases to handle adding a simple type like this. I'm also worried that I still don't see any signs that you're working with the downstream libraries that this functionality is intended to be useful for, like the various HDF5 libraries and pandas. I really don't think this functionality can be merged to numpy until we have affirmative statements from those developers that they are excited about it and will use it, and since they're busy people, it's pretty much your job to track them down and make sure that your code will solve their problems. Hope that helps -- it's exciting to see someone working on this, and you seem to be off to a good start! -N