[Numpy-discussion] Enum/Factor NEP (now with code)

Wed Jun 13 14:12:05 EDT 2012

On Wed, Jun 13, 2012 at 5:44 PM, Bryan Van de Ven <bryanv at continuum.io> wrote:
> On 6/13/12 8:33 AM, Nathaniel Smith wrote:
>> Hi Bryan,
>>
>> I skimmed over the diff:
>>     https://github.com/bryevdv/numpy/compare/master...enum
>> It was a bit hard to read since it seems like about half the changes
>> in that branch are datatime cleanups or something? I hope you'll
>> separate those out -- it's much easier to review self-contained
>> changes, and the more changes you roll together into a big lump, the
>> more risk there is that they'll get lost all together.
>
> I'm not quite sure what happened there, my git skills are not advanced
> by any measure. I think the datetime changes are a much smaller fraction
> than fifty percent, but I will see what I can do to separate them out in
> the near future.

Looking again, it looks like a lot of it is actually because when I
asked github to show me the diff between your branch and master, it
showed me the diff between your branch and *your repository's* version
of master. And your branch is actually based off a newer version of
'master' than you have in your repository. So, as far as git and
github are concerned, all those changes that are included in
your-branch's-base-master but not in your-repo's-master are new stuff
that you did on your branch. Solution is just to do
  git push <your github remote name> master

>>  From the updated NEP I actually understand the use case for "open
>> types" now, so that's good :-). But I don't think they're actually
>> workable, so that's bad :-(. The use case, as I understand it, is for
>> when you want to extend the levels set on the fly as you read through
>> a file. The problem with this is that it produces a non-deterministic
>> level ordering, where level 0 is whatever was seen first in the file,
>> level 1 is whatever was seen second, etc. E.g., say I have a CSV file
>> I read in:
>>
>>      subject,initial_skill,skill_after_training
>>      1,LOW,HIGH
>>      2,LOW,LOW
>>      3,HIGH,HIGH
>>      ...
>>
>> With the scheme described in the NEP, my initial_skill dtype will have
>> levels ["LOW", "HIGH"], and by skill_after_training dtype will have
>> levels ["HIGH","LOW"], which means that their storage will be
>> incompatible, comparisons won't work (or will have to go through some
>
> I imagine users using the same open dtype object in both fields of the
> structure dtype used to read in the file, if both fields of the file
> contain the same categories. If they don't contain the same categories,
> they are incomparable in any case. I believe many users have this
> simpler use case where each field is a separate category, and they want
> to read them all individually, separately on the fly.  For these simple
> cases, it would "just work". For your case example there would
> definitely be a documentation, examples, tutorials, education issue, to
> avoid the "gotcha" you describe.

Yes, of course we *could* write the code to implement these "open"
dtypes, and then write the documentation, examples, tutorials, etc. to
help people work around their limitations. Or, we could just implement
np.fromfile properly, which would require no workarounds and take less
code to boot.

>> nasty convert-to-string-and-back path), etc. Another situation where
>> this will occur is if you have multiple data files in the same format;
>> whether or not you're able to compare the data from them will depend
>> on the order the data happens to occur in in each file. The solution
>> is that whenever we automagically create a set of levels from some
>> data, and the user hasn't specified any order, we should pick an order
>> deterministically by sorting the levels. (This is also what R does.
>> levels(factor(c("a", "b"))) ->  "a", "b". levels(factor(c("b", "a")))
>> ->  "a", "b".)
>
> A solution is to create the dtype object when reading in the first file,
> and to reuse that same dtype object when reading in subsequent files.
> Perhaps it's not ideal, but it does enable the work to be done.

So would a proper implementation of np.fromfile that normalized the
level ordering.

>> Can you explain why you're using khash instead of PyDict? It seems to
>> add a *lot* of complexity -- like it seems like you're using about as
>> many lines of code just marshalling data into and out of the khash as
>> I used for my old npenum.pyx prototype (not even counting all the
>> extra work required to , and AFAICT my prototype has about the same
>> amount of functionality as this. (Of course that's not entirely fair,
>> because I was working in Cython... but why not work in Cython?) And
>> you'll need to expose a Python dict interface sooner or later anyway,
>> I'd think?
>
> I suppose I agree with the sentiment that the core of NumPy really ought
> to be less dependent on the Python C API, not more. I also think the
> khash API is pretty dead simple and straightforward, and the fact that
> it is contained in a singe header is attractive.  It's also quite
> performant in time and space. But if others disagree strongly, all of
> it's uses are hidden behind the interface in leveled_dtypes.c, it could
> be replaced with some other mechanism easily enough.

I'm not at all convinced by the argument that throwing in random
redundant data types into NumPy will somehow reduce our dependence on
the Python C API. If you have a plan to replace *all* use of dicts in
numpy with khash, then we can talk about that, I guess. But that would
be a separate patch, and I don't think using PyDict in this patch
would really have any effect on how difficult that separate patch was
to do.

PyDict also has a very simple API -- and in fact, the comparison is
between the PyDict API+the khash API, versus just the PyDict API
alone, since everyone working with the Python C API already has to
know how that works. It's also contained in effectively zero header
files, which is even more attractive than one header file. And that
interface in leveled_dtypes.c is the one that I was talking about
being larger than my entire categorical dtype implementation.

None of this means that using it is a bad idea, of course! Maybe it
has some key advantage over PyDict in terms of memory use or
something, for those people who have hundreds of thousands of distinct
categories in their data, I don't know. But all your arguments here
seem to be of the form "hey, it's not *that* bad", and it seems like
there must be some actual affirmative advantages it has over PyDict if
it's going to be worth using.

>> I can't tell if it's worth having categorical scalar types. What value
>> do they provide over just using scalars of the level type?
>
> I'm not certain they are worthwhile either, which is why I did not spend
> any time on them yet. Wes has expressed a desire for very broad
> categorical types (even more than just scalar categories), hopefully he
> can chime in with his motivations.
>
>> Terminology: I'd like to suggest we prefer the term "categorical" for
>> this data, rather than "factor" or "enum". Partly this is because it
>> makes my life easier ;-):
>>    https://groups.google.com/forum/#!msg/pystatsmodels/wLX1-a5Y9fg/04HFKEu45W4J
>> and partly because numpy has a very diverse set of users and I suspect
>> that "categorical" will just be a more transparent name to those who
>> aren't already familiar with the particular statistical and
>> programming traditions that "factor" and "enum" come from.
>
> I think I like "categorical" over "factor" but I am not sure we should
> ditch "enum". There are two different use cases here: I have a pile of
> strings (or scalars) that I want to treat as discrete things
> (categories), and: I have a pile of numbers that I want to give
> convenient or meaningful names to (enums). This latter case was the
> motivation for possibly adding "Natural Naming".

So mention the word "enum" in the documentation, so people looking for
that will find the categorical data support? :-)

>> I'm disturbed to see you adding special cases to the core ufunc
>> dispatch machinery for these things. I'm -1 on that. We should clean
>> up the generic ufunc machinery so that it doesn't need special cases
>> to handle adding a simple type like this.
>
> This could certainly be improved, I agree.

I don't want to be Mr. Grumpypants here, but I do want to make sure
we're speaking the same language: what "-1" means is "I consider this
a show-stopper and will oppose merging any code that does not improve
on this". (Of course you also always have the option of trying to
change my mind. Even Mr. Grumpypants can be swayed by logic!)

>> I'm also worried that I still don't see any signs that you're working
>> with the downstream libraries that this functionality is intended to
>> be useful for, like the various HDF5 libraries and pandas. I really
>> don't think this functionality can be merged to numpy until we have
>> affirmative statements from those developers that they are excited
>> about it and will use it, and since they're busy people, it's pretty
>> much your job to track them down and make sure that your code will
>> solve their problems.
>
> Francesc is certainly aware of this work, and I emailed Wes earlier this
> week, I probably should have mentioned that, though. Hopefully they will
> have time to contribute their thoughts. I also imagine Travis can speak
> on behalf of the users he has interacted with over the last several
> years that have requested this feature that don't happen to follow
> mailing lists.

I'm glad Francesc and Wes are aware of the work, but my point was that
that isn't enough. So if I were in your position and hoping to get
this code merged, I'd be trying to figure out how to get them more
actively on board?

-N