[Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type
jeffreback at gmail.com
Mon Sep 22 10:34:55 EDT 2014
Hopefully this is not TL;DR!
Their are 3 'dtype' likes that exist in pandas that could in theory mostly
be migrated back to numpy. These currently exist as the .values in-other-words
the object to which pandas defers data storage and computation for
some/most of operations.
1) SparseArray: This is the basis for SparseSeries. It is ndarray-like (its
actually a ndarray-sub-class) and optimized for the 1-d case. My guess is
that @wesm <https://github.com/wesm> created this because it a) didn't
exist in numpy, and b) didn't want scipy as an explicity dependency (at the
time), late 2011.
2) datetime support: This is not a target dtype per se, but really a
reimplementation over the top of datetime64[ns], with the associated scalar
Timestamp which is a proper sub-class of datetime.datetime. I believe @wesm
<https://github.com/wesm> created this because numpy datetime support was
(and still is to some extent) just completely broken (though better in
1.7+). It doesn't support proper timezones, the display is always in the
local timezone., and the scalar type (np.datetime64) is not extensible at
all (e.g. so have not easy to have custom printing, or parsing). These are
all well known by the numpy community and have seen some recent proposals
3) pd.Categorical: This was another class wesm wrote several years ago. It
is actually *could* be a numpy sub-class, though its a bit awkward as its
really a numpy-like sub-class that contains 2 ndarray-like arrays, and is
more appropriately implemented as a container of multiple-ndarrays.
So when we added support for Categoricals recently, why didn't we say try
to push a categorical dtype? I think their are several reasons, in no
pd.Categorical is really a container of multiple ndarrays, and is
ndarray-like. Further its API is somewhat constrained. It was simpler to
make a python container class rather than try to sub-class ndarray and
basically override / throw out many methods (as a lot of computation
methods simply don't make sense between 2 categoricals). You can make a
case that this *should not * be in numpy for this reason.
The changes in pandas for the 3 cases outlined above, were mostly on how
to integrate these with the top-level containers (Series/DataFrame), rather
than actually writing / re-writing a new dtype for a ndarray class. We
always try to reuse, so we just try to extend the ndarray-like rather than
create a new one from scratch.
Getting for example a Categorical dtype into numpy prob would take a
pretty long cycle time. I think you need a champion for new features to
really push them. It hasn't happened with datetime and that's been a while
(of course its possible that pandas diverted some of this need)
API design: I think this is a big issue actually. When I added
Categorical container support, I didn't want to change the API of
Categorical much (and it pretty much worked out that way, mainly adding
to it). So, say we took the path of assuming that numpy would have a nice
categorical data dtype. We would almost certainly have to wrap it in
something to provided needed functionaility that would necessarily be
missing in an initial version. (of course eventually that may not be
So the 'nobody wants to write in C' argument is true for datetimes, but
not for SparseArray/Categorical. In fact much of that code is just
calling out to numpy (though some cython code too).
from a performance perspective, numpy needs a really good hashtable in
order to support proper factorizing, which @wesm
<https://github.com/wesm> co-opted klib to do (see this thread here
a discussion on this).
So I know I am repeating myself, but it comes down to this. The
API/interface of the delegated methods needs to be defined. For ndarrays it
is long established and well-known. So easy to gear pandas to that. However
with a *newer* type that is not the case, so pandas can easily decide, hey
this is the most correct behavior, let's do it this way, nothing to break,
no back compat needed.
On Sun, Sep 21, 2014 at 11:31 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
> > pandas has some hacks to support custom types of data for which numpy
> > handle well enough or at all. Examples include datetime and Categorical
> > and others like GeoArray  that haven't make it into pandas yet.
> > Most of these look like numpy arrays but with custom dtypes and type
> > specific methods/properties. But clearly nobody is particularly excited
> > about writing the the C necessary to implement custom dtypes . Nor is
> > we need the ndarray ABI.
> > In many cases, writing C may not actually even be necessary for
> > reasons, e.g., categorical can be fast enough just by wrapping an integer
> > ndarray for the internal storage and using vectorized operations. And
> > if it is necessary, I think we'd all rather write Cython than C.
> > It's great for pandas to write its own ndarray-like wrappers (*not*
> > subclasses) that work with pandas, but it's a shame that there isn't a
> > standard interface like the ndarray to make these arrays useable for the
> > rest of the scientific Python ecosystem. For example, pandas has loads of
> > fixes for np.datetime64, but nobody seems to be up for porting them to
> > (I doubt it would be easy).
> Writing them in the first place probably wasn't easy either :-). I
> don't really know why pandas spends so much effort on reimplementing
> stuff and papering over numpy limitations instead of fixing things
> upstream so that everyone can benefit. I assume they have reasons, and
> I could make some general guesses at what some of them might be, but
> if you want to know what they are -- which is presumably the first
> step in changing the situation -- you'll have to ask them, not us :-).
> > I know these sort of concerns are not new, but I wish I had a sense of
> > the solution looks like. Is anyone actively working on these issues? Does
> > the fix belong in numpy, pandas, blaze or a new project? I'd love to get
> > sense of where things stand and how I could help -- without writing any C
> > :).
> I think there are there are three parts:
> For stuff that's literally just fixing bugs in stuff that numpy
> already has, then we'd certainly be happy to accept those bug fixes.
> Probably there are things we can do to make this easier, I dunno. I'd
> love to see some of numpy's internals moving into Cython to make them
> easier to hack on, but this won't be simple because right now using
> Cython to implement a module is really an all-or-nothing affair;
> making it possible to mix Cython with numpy's existing C code will
> require upstream changes in Cython.
> For cases where people genuinely want to implement a new array-like
> types (e.g. DataFrame or scipy.sparse) then numpy provides a fair
> amount of support for this already (e.g., the various hooks that allow
> things like np.asarray(mydf) or np.sin(mydf) to work), and we're
> working on adding more over time (e.g., __numpy_ufunc__).
> My feeling though is that in most of the cases you mention,
> implementing a new array-like type is huge overkill. ndarray's
> interface is vast and reimplementing even 90% of it is a huge effort.
> For most of the cases that people seem to run into in practice, the
> solution is to enhance numpy's dtype interface so that it's possible
> for mere mortals to implement new dtypes, e.g. by just subclassing
> np.dtype. This is totally doable and would enable a ton of
> awesomeness, but it requires someone with the time to sit down and
> work on it, and no-one has volunteered yet. Unfortunately it does
> require hacking on C code though.
> Nathaniel J. Smith
> Postdoctoral researcher - Informatics - University of Edinburgh
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion