[Numpy-discussion] Custom dtypes without C -- or, a standard ndarray-like type

David Cournapeau cournape at gmail.com
Tue Sep 23 03:19:12 EDT 2014


On Mon, Sep 22, 2014 at 4:31 AM, Nathaniel Smith <njs at pobox.com> wrote:

> On Sun, Sep 21, 2014 at 7:50 PM, Stephan Hoyer <shoyer at gmail.com> wrote:
> > pandas has some hacks to support custom types of data for which numpy
> can't
> > handle well enough or at all. Examples include datetime and Categorical
> [1],
> > and others like GeoArray [2] that haven't make it into pandas yet.
> >
> > Most of these look like numpy arrays but with custom dtypes and type
> > specific methods/properties. But clearly nobody is particularly excited
> > about writing the the C necessary to implement custom dtypes [3]. Nor is
> do
> > we need the ndarray ABI.
> >
> > In many cases, writing C may not actually even be necessary for
> performance
> > reasons, e.g., categorical can be fast enough just by wrapping an integer
> > ndarray for the internal storage and using vectorized operations. And
> even
> > if it is necessary, I think we'd all rather write Cython than C.
> >
> > It's great for pandas to write its own ndarray-like wrappers (*not*
> > subclasses) that work with pandas, but it's a shame that there isn't a
> > standard interface like the ndarray to make these arrays useable for the
> > rest of the scientific Python ecosystem. For example, pandas has loads of
> > fixes for np.datetime64, but nobody seems to be up for porting them to
> numpy
> > (I doubt it would be easy).
>
> Writing them in the first place probably wasn't easy either :-). I
> don't really know why pandas spends so much effort on reimplementing
> stuff and papering over numpy limitations instead of fixing things
> upstream so that everyone can benefit. I assume they have reasons, and
> I could make some general guesses at what some of them might be, but
> if you want to know what they are -- which is presumably the first
> step in changing the situation -- you'll have to ask them, not us :-).
>
> > I know these sort of concerns are not new, but I wish I had a sense of
> what
> > the solution looks like. Is anyone actively working on these issues? Does
> > the fix belong in numpy, pandas, blaze or a new project? I'd love to get
> a
> > sense of where things stand and how I could help -- without writing any C
> > :).
>
> I think there are there are three parts:
>
> For stuff that's literally just fixing bugs in stuff that numpy
> already has, then we'd certainly be happy to accept those bug fixes.
> Probably there are things we can do to make this easier, I dunno. I'd
> love to see some of numpy's internals moving into Cython to make them
> easier to hack on, but this won't be simple because right now using
> Cython to implement a module is really an all-or-nothing affair;
> making it possible to mix Cython with numpy's existing C code will
> require upstream changes in Cython.


> For cases where people genuinely want to implement a new array-like
> types (e.g. DataFrame or scipy.sparse) then numpy provides a fair
> amount of support for this already (e.g., the various hooks that allow
> things like np.asarray(mydf) or np.sin(mydf) to work), and we're
> working on adding more over time (e.g., __numpy_ufunc__).
>
> My feeling though is that in most of the cases you mention,
> implementing a new array-like type is huge overkill. ndarray's
> interface is vast and reimplementing even 90% of it is a huge effort.
> For most of the cases that people seem to run into in practice, the
> solution is to enhance numpy's dtype interface so that it's possible
> for mere mortals to implement new dtypes, e.g. by just subclassing
> np.dtype. This is totally doable and would enable a ton of
> awesomeness, but it requires someone with the time to sit down and
> work on it, and no-one has volunteered yet. Unfortunately it does
> require hacking on C code though.
>

While preparing my tutorial on NumPy C internals 1 year ago, I tried to get
a "basic" dtype implemented in cython, and there were various issues even
if you wanted to do all of it in cython (I can't remember the details now).
Solving this would be a good first step.

There were (are ?) also some issues regarding precedence in ufuncs
depending on the new dtype: numpy hardcodes that long double is the highest
precision floating point type, for example, and there were similar issues
regarding datetime handling. Does not matter for completely new types that
don't require interactions with others (categorical ?).

Would it help to prepare a set of "implement your own dtype" notebooks ? I
have a starting point from last year tutorial (the corresponding slides
were never shown for lack of time).

David



>
> --
> Nathaniel J. Smith
> Postdoctoral researcher - Informatics - University of Edinburgh
> http://vorpus.org
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140923/90fbfa6d/attachment.html>


More information about the NumPy-Discussion mailing list