
On Mi, 2015-08-26 at 00:05 -0700, Nathaniel Smith wrote:
On Tue, Aug 25, 2015 at 5:53 PM, David Cournapeau <cournape@gmail.com> wrote:
Thanks for the good summary Nathaniel.
Regarding dtype machinery, I agree casting is the hardest part. Unless the code has changed dramatically, this was the main reason why you could not make most of the dtypes separate from numpy codebase (I tried to move the datetime dtype out of multiarray into a separate C extension some years ago). Being able to separate the dtypes from the multiarray module would be an obvious way to drive the internal API change.
For practical reasons I don't imagine we'll ever want to actually move the core dtypes out of multiarray -- if nothing else they will always remain a little bit special, like np.array([1.0, 2.0]) will just "know" that this should use the float64 dtype. But yeah, in general a good heuristic would be that -- aside from a few limited cases like that -- we want to make built-in dtypes and user-defined dtypes use the same APIs.
Well, casting is the conceptional hardest part. Marrying it to the rest of numpy is probably just as hard ;). With the chance of not having thought this through enough, maybe some points about the general discussion. I think I would like some more clarity of what we want and especially *need* [1]. From SciPy, there were two things I particularly remember: 1. the dtype/scalar issue 2. making an interface to make array-likes interaction more sane (this I think can go quite far, and we are already going part of it) The dtypes/scalars seem a particularly dark corner of numpy and if it is feasible for us to replace it with something new, then I would be willing to do some breaks for it (admittingly, given protest, I would back down from that and another solution would be needed). The point for me is, I currently think a dtype/scalar could get numpy a big way, especially from the point of view of downstream packages. Of course it would be harder to do in numpy then in something new, but it should also be of much more immediate use. Maybe I am going a bit too far with this right now, but I could imagine that if we cannot clean up the dtype/scalars, numpy may indeed be doomed or at least a brick slowing down a lot of other people. And if it is not possible to do this without a numpy 2, then likely that is the way to go. But I am not convinced we should aim to fix all the other stuff at the same time. I am afraid it would just accumulate to grow over everyones heads. In other words, I think if we can muster the resources I would like to see this problem attacked within numpy. If this proves impossible a new dtype abstraction may well be reason for numpy 2, or used by a DyND or similar? But I do believe we should not give up on Numpy here from the start, at least I do not see a compelling reason to do. Instead giving up on numpy seems like the last way out of a misery. And much of the different opinions to me seem to be whether we think this will clearly happen or not or has already happened (or maybe whether it is too costly to do in numpy). Cleaning it up, would open doors to many things. Note that I think it would make the numpy source much less scary, because I think it is the one big piece of code that is maybe not clearly a separate chunk [2]. After making it sane, I would argue that numpy does become much more maintainable and extensible. From my current view, probably enough so for a long time. Also, I think it would give us abstraction to make different/new projects work together better and if done well enough, some grand new project set to replace numpy could reuse it. Of course it is entirely possible that more things need to be changed in numpy and that some others would be just as hard or even harder to do. But if we can identify this as the "one big thing that gets us 90%" then I refuse to give up hope of doing it in numpy just yet. - Sebastian [1] Travis has said quite a lot about it, but it is not yet clear to me what is a priority/real pain point. Take "datashape" for example. By now I think that the datashape is likely a good idea to make structured arrays nicer, since it moves the "structured" part into the array object and not the dtype, which makes sense to me. However, I am not convinced that the datashape is something that would make numpy a compelling amount better. In fact I could imagine that for many things it would make it unnecessarily more complicated for users. [2] Take indexing, I like to think I did not break that much when redoing it (except on purpose, which I hope did not create much trouble). In some sense indexing was simple to redo, because it does not overlap at all with anything else directly. If we get dtypes/scalars more separated, I think we are at a point where this is possible with pretty much any part of numpy.
Regarding the use of cython in numpy, was there any discussion about the compilation/size cost of using cython, and talking to the cython team to improve this ? Or was that considered acceptable with current cython for numpy. I am convinced cleanly separating the low level parts from the python C API plumbing would be the single most important thing one could do to make the codebase more amenable.
It's still a more blue-sky idea than that... the discussion was more at the level of "is this something that is even worth trying to make work and seeing where the problems are?"
The big immediate problem, before we got into code size issues, would be that we would need to be able to compile a mix of .pyx files and .c files into a single .so, while cython generated code currently makes some strong assumptions about how each .pyx file will live in its own .so. From playing around with it I suspect the first version of making this work will be klugey indeed. But yeah, the thing to do would be for someone to dig in and make the kluges and then decide how to clean them up once you know where they are.
-n