[Numpy-discussion] Enum/Factor NEP (now with code)
Bryan Van de Ven
bryanv at continuum.io
Wed Jun 13 12:44:14 EDT 2012
On 6/13/12 8:33 AM, Nathaniel Smith wrote:
> Hi Bryan,
>
> I skimmed over the diff:
> https://github.com/bryevdv/numpy/compare/master...enum
> It was a bit hard to read since it seems like about half the changes
> in that branch are datatime cleanups or something? I hope you'll
> separate those out -- it's much easier to review self-contained
> changes, and the more changes you roll together into a big lump, the
> more risk there is that they'll get lost all together.
I'm not quite sure what happened there, my git skills are not advanced
by any measure. I think the datetime changes are a much smaller fraction
than fifty percent, but I will see what I can do to separate them out in
the near future.
> From the updated NEP I actually understand the use case for "open
> types" now, so that's good :-). But I don't think they're actually
> workable, so that's bad :-(. The use case, as I understand it, is for
> when you want to extend the levels set on the fly as you read through
> a file. The problem with this is that it produces a non-deterministic
> level ordering, where level 0 is whatever was seen first in the file,
> level 1 is whatever was seen second, etc. E.g., say I have a CSV file
> I read in:
>
> subject,initial_skill,skill_after_training
> 1,LOW,HIGH
> 2,LOW,LOW
> 3,HIGH,HIGH
> ...
>
> With the scheme described in the NEP, my initial_skill dtype will have
> levels ["LOW", "HIGH"], and by skill_after_training dtype will have
> levels ["HIGH","LOW"], which means that their storage will be
> incompatible, comparisons won't work (or will have to go through some
I imagine users using the same open dtype object in both fields of the
structure dtype used to read in the file, if both fields of the file
contain the same categories. If they don't contain the same categories,
they are incomparable in any case. I believe many users have this
simpler use case where each field is a separate category, and they want
to read them all individually, separately on the fly. For these simple
cases, it would "just work". For your case example there would
definitely be a documentation, examples, tutorials, education issue, to
avoid the "gotcha" you describe.
> nasty convert-to-string-and-back path), etc. Another situation where
> this will occur is if you have multiple data files in the same format;
> whether or not you're able to compare the data from them will depend
> on the order the data happens to occur in in each file. The solution
> is that whenever we automagically create a set of levels from some
> data, and the user hasn't specified any order, we should pick an order
> deterministically by sorting the levels. (This is also what R does.
> levels(factor(c("a", "b"))) -> "a", "b". levels(factor(c("b", "a")))
> -> "a", "b".)
A solution is to create the dtype object when reading in the first file,
and to reuse that same dtype object when reading in subsequent files.
Perhaps it's not ideal, but it does enable the work to be done.
> Can you explain why you're using khash instead of PyDict? It seems to
> add a *lot* of complexity -- like it seems like you're using about as
> many lines of code just marshalling data into and out of the khash as
> I used for my old npenum.pyx prototype (not even counting all the
> extra work required to , and AFAICT my prototype has about the same
> amount of functionality as this. (Of course that's not entirely fair,
> because I was working in Cython... but why not work in Cython?) And
> you'll need to expose a Python dict interface sooner or later anyway,
> I'd think?
I suppose I agree with the sentiment that the core of NumPy really ought
to be less dependent on the Python C API, not more. I also think the
khash API is pretty dead simple and straightforward, and the fact that
it is contained in a singe header is attractive. It's also quite
performant in time and space. But if others disagree strongly, all of
it's uses are hidden behind the interface in leveled_dtypes.c, it could
be replaced with some other mechanism easily enough.
> I can't tell if it's worth having categorical scalar types. What value
> do they provide over just using scalars of the level type?
I'm not certain they are worthwhile either, which is why I did not spend
any time on them yet. Wes has expressed a desire for very broad
categorical types (even more than just scalar categories), hopefully he
can chime in with his motivations.
> Terminology: I'd like to suggest we prefer the term "categorical" for
> this data, rather than "factor" or "enum". Partly this is because it
> makes my life easier ;-):
> https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J
> and partly because numpy has a very diverse set of users and I suspect
> that "categorical" will just be a more transparent name to those who
> aren't already familiar with the particular statistical and
> programming traditions that "factor" and "enum" come from.
I think I like "categorical" over "factor" but I am not sure we should
ditch "enum". There are two different use cases here: I have a pile of
strings (or scalars) that I want to treat as discrete things
(categories), and: I have a pile of numbers that I want to give
convenient or meaningful names to (enums). This latter case was the
motivation for possibly adding "Natural Naming".
> I'm disturbed to see you adding special cases to the core ufunc
> dispatch machinery for these things. I'm -1 on that. We should clean
> up the generic ufunc machinery so that it doesn't need special cases
> to handle adding a simple type like this.
This could certainly be improved, I agree.
> I'm also worried that I still don't see any signs that you're working
> with the downstream libraries that this functionality is intended to
> be useful for, like the various HDF5 libraries and pandas. I really
> don't think this functionality can be merged to numpy until we have
> affirmative statements from those developers that they are excited
> about it and will use it, and since they're busy people, it's pretty
> much your job to track them down and make sure that your code will
> solve their problems.
Francesc is certainly aware of this work, and I emailed Wes earlier this
week, I probably should have mentioned that, though. Hopefully they will
have time to contribute their thoughts. I also imagine Travis can speak
on behalf of the users he has interacted with over the last several
years that have requested this feature that don't happen to follow
mailing lists.
Thanks,
Bryan
More information about the NumPy-Discussion
mailing list