[Numpy-discussion] Does a `mergesorted` function make sense?

Nathaniel Smith njs at pobox.com
Mon Sep 1 08:18:15 EDT 2014

On Mon, Sep 1, 2014 at 8:49 AM, Eelco Hoogendoorn
<hoogendoorn.eelco at gmail.com> wrote:
> Sure, id like to do the hashing things out, but I would also like some
> preliminary feedback as to whether this is going in a direction anyone else
> sees the point of, if it conflicts with other plans, and indeed if we can
> agree that numpy is the right place for it; a point which I would very much
> like to defend. If there is some obvious no-go that im missing, I can do
> without the drudgery of writing proper documentation ;).
> As for whether this belongs in numpy: yes, I would say so. There are the
> extension of functionality to functions already in numpy, which are a
> no-brainer (it need not cost anything performance wise, and ive needed
> unique graph edges many many times), and there is the grouping
> functionality, which is the main novelty.
> However, note that the grouping functionality itself is a very small
> addition, just a few 100 lines of pure python, given that the indexing logic
> has been factored out of the classic arraysetops. At least from a developers
> perspective, it very much feels like a logical extension of the same
> 'thing'.

My 2 cents: I definitely agree that this is very useful fundamental
functionality, and it would be great if numpy had a solution for it
out of the box. My main concern is that this is a fairly complicated
set of functionality and there are a lot of small decisions to be made
in setting up the API for it. IME it's very hard to just read through
an API like this and reason out the best way to do it by pure logic;
usually it needs to get banged on for a bit in real uses before it
becomes clear what the right set of trade-offs is. And numpy itself is
not a great environment these kinds of iterations. So, IMO the main
challenge is: how do we get the functionality into a state where we
can convince ourselves that it'll be supportable in numpy
indefinitely, and not need to be replaced in a year or two?

Some things that might help with this convincing:
- releasing it as a small standalone package on pypi and getting some
real users to bang on it
- any real code written against the APIs
- feedback from the pandas community since they've spent a lot of time
working on these issues
- ...?


Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh

More information about the NumPy-Discussion mailing list