[Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

Nathaniel Smith njs at pobox.com
Sun Sep 21 23:31:30 EDT 2014

On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
> pandas has some hacks to support custom types of data for which numpy can't
> handle well enough or at all. Examples include datetime and Categorical [1],
> and others like GeoArray [2] that haven't make it into pandas yet.
> Most of these look like numpy arrays but with custom dtypes and type
> specific methods/properties. But clearly nobody is particularly excited
> about writing the the C necessary to implement custom dtypes [3]. Nor is do
> we need the ndarray ABI.
> In many cases, writing C may not actually even be necessary for performance
> reasons, e.g., categorical can be fast enough just by wrapping an integer
> ndarray for the internal storage and using vectorized operations. And even
> if it is necessary, I think we'd all rather write Cython than C.
> It's great for pandas to write its own ndarray-like wrappers (*not*
> subclasses) that work with pandas, but it's a shame that there isn't a
> standard interface like the ndarray to make these arrays useable for the
> rest of the scientific Python ecosystem. For example, pandas has loads of
> fixes for np.datetime64, but nobody seems to be up for porting them to numpy
> (I doubt it would be easy).

Writing them in the first place probably wasn't easy either :-). I
don't really know why pandas spends so much effort on reimplementing
stuff and papering over numpy limitations instead of fixing things
upstream so that everyone can benefit. I assume they have reasons, and
I could make some general guesses at what some of them might be, but
if you want to know what they are -- which is presumably the first
step in changing the situation -- you'll have to ask them, not us :-).

> I know these sort of concerns are not new, but I wish I had a sense of what
> the solution looks like. Is anyone actively working on these issues? Does
> the fix belong in numpy, pandas, blaze or a new project? I'd love to get a
> sense of where things stand and how I could help -- without writing any C
> :).

I think there are there are three parts:

For stuff that's literally just fixing bugs in stuff that numpy
already has, then we'd certainly be happy to accept those bug fixes.
Probably there are things we can do to make this easier, I dunno. I'd
love to see some of numpy's internals moving into Cython to make them
easier to hack on, but this won't be simple because right now using
Cython to implement a module is really an all-or-nothing affair;
making it possible to mix Cython with numpy's existing C code will
require upstream changes in Cython.

For cases where people genuinely want to implement a new array-like
types (e.g. DataFrame or scipy.sparse) then numpy provides a fair
amount of support for this already (e.g., the various hooks that allow
things like np.asarray(mydf) or np.sin(mydf) to work), and we're
working on adding more over time (e.g., __numpy_ufunc__).

My feeling though is that in most of the cases you mention,
implementing a new array-like type is huge overkill. ndarray's
interface is vast and reimplementing even 90% of it is a huge effort.
For most of the cases that people seem to run into in practice, the
solution is to enhance numpy's dtype interface so that it's possible
for mere mortals to implement new dtypes, e.g. by just subclassing
np.dtype. This is totally doable and would enable a ton of
awesomeness, but it requires someone with the time to sit down and
work on it, and no-one has volunteered yet. Unfortunately it does
require hacking on C code though.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh

More information about the NumPy-Discussion mailing list