numpy gsoc ideas (was: numpy gsoc topic idea: configurable algorithm precision and vector math library integration)

On Mon, Mar 3, 2014 at 7:20 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
hi,
as the numpy gsoc topic page is a little short on options I was thinking about adding two topics for interested students. But as I have no experience with gsoc or mentoring and the ideas are not very fleshed out yet I'd like to ask if it might make sense at all:
1. configurable algorithm precision [...] with np.precmode(default="fast"): np.abs(complex_array)
or fast everything except sum and hypot
with np.precmode(default="fast", sum="kahan", hypot="standard"): np.sum(d) [...]
Not a big fan of this one -- it seems like the biggest bulk of the effort would be in figuring out a non-horrible API for exposing these things and getting consensus around it, which is not a good fit to the SoC structure. I'm pretty nervous about the datetime proposal that's currently on the wiki, for similar reasons -- I'm not sure it's actually doable in the SoC context.
2. vector math library integration
This is a great suggestion -- clear scope, clear benefit. Two more ideas: 3. Using Cython in the numpy core The numpy core contains tons of complicated C code implementing elaborate operations like indexing, casting, ufunc dispatch, etc. It would be really nice if we could use Cython to write some of these things. However, there is a practical problem: Cython assumes that each .pyx file generates a single compiled module with its own Cython-defined API. Numpy, however, contains a large number of .c files which are all compiled together into a single module, with its own home-brewed system for defining the public API. And we can't rewrite the whole thing. So for this to be viable, we would need some way to compile a bunch of .c *and .pyx* files together into a single module, and allow the .c and .pyx files to call each other. This might involve changes to Cython, some sort of clever post-processing or glue code to get existing cython-generated source code to play nicely with the rest of numpy, or something else. So this project would have the following goals, depending on how practical this turns out to be: (1) produce a hacky proof-of-concept system for doing the above, (2) turn the hacky proof-of-concept into something actually viable for use in real life (possibly this would require getting changes upstream into Cython, etc.), (3) use this system to actually port some interesting numpy code into cython. 4. Pythonic dtypes The current dtype system is klugey. It basically defines its own class system, in parallel to Python's, and unsurprisingly, this new class system is not as good. In particular, it has limitations around the storage of instance-specific data which rule out a large variety of interesting user-defined dtypes, and causes us to need some truly nasty hacks to support the built-in dtypes we do have. And it makes defining a new dtype much more complicated than defining a new Python class. This project would be to implement a new dtype system for numpy, in which np.dtype becomes a near-empty base class, different dtypes (e.g., float64, float32) are simply different subclasses of np.dtype, and dtype objects are simply instances of these classes. Further enhancements would be to make it possible to define new dtypes in pure Python by subclassing np.dtype and implementing special methods for the various dtype operations, and to make it possible for ufunc loops to see the dtype objects. This project would provide the key enabling piece for a wide variety of interesting new features: missing value support, better handling of strings and categorical data, unit handling, automatic differentiation, and probably a bunch more I'm forgetting right now. If we get someone who's up to handling the dtype thing then I can mentor or co-mentor. What do y'all think? (I don't think I have access to update that wiki page -- or maybe I'm just not clever enough to figure out how -- so it would be helpful if someone who can, could?) -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Nathaniel Smith <njs@pobox.com> wrote:
3. Using Cython in the numpy core
The numpy core contains tons of complicated C code implementing elaborate operations like indexing, casting, ufunc dispatch, etc. It would be really nice if we could use Cython to write some of these things.
So the idea of having a NumPy as a pure C library in the core is abandoned?
However, there is a practical problem: Cython assumes that each .pyx file generates a single compiled module with its own Cython-defined API. Numpy, however, contains a large number of .c files which are all compiled together into a single module, with its own home-brewed system for defining the public API. And we can't rewrite the whole thing. So for this to be viable, we would need some way to compile a bunch of .c *and .pyx* files together into a single module, and allow the .c and .pyx files to call each other.
Cython takes care of that already. http://docs.cython.org/src/userguide/sharing_declarations.html#cimport http://docs.cython.org/src/userguide/external_C_code.html#using-cython-decla... Sturla

On Thu, Mar 6, 2014 at 5:17 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
Nathaniel Smith <njs@pobox.com> wrote:
3. Using Cython in the numpy core
The numpy core contains tons of complicated C code implementing elaborate operations like indexing, casting, ufunc dispatch, etc. It would be really nice if we could use Cython to write some of these things.
So the idea of having a NumPy as a pure C library in the core is abandoned?
This question doesn't make sense to me so I think I must be missing some context. Nothing is abandoned: This is one email by one person on one mailing list suggesting a project to the explore the feasibility of something. And anyway, Cython is just a C code generator, similar in principle to (though vastly more sophisticated than) the ones we already use. It's not like we've ever promised our users we'll keep stable which kind of code generators we use internally.
However, there is a practical problem: Cython assumes that each .pyx file generates a single compiled module with its own Cython-defined API. Numpy, however, contains a large number of .c files which are all compiled together into a single module, with its own home-brewed system for defining the public API. And we can't rewrite the whole thing. So for this to be viable, we would need some way to compile a bunch of .c *and .pyx* files together into a single module, and allow the .c and .pyx files to call each other.
Cython takes care of that already.
http://docs.cython.org/src/userguide/sharing_declarations.html#cimport
http://docs.cython.org/src/userguide/external_C_code.html#using-cython-decla...
Linking multiple .c and .pyx files together into a single .so/.dll is much more complicated than just using 'cimport'. Try it if you don't believe me :-). -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Wed, Mar 5, 2014 at 9:17 PM, Sturla Molden <sturla.molden@gmail.com>wrote:
we could use Cython to write some of these things.
So the idea of having a NumPy as a pure C library in the core is abandoned?
And at some point, there was the idea of a numpy_core library that could be used entirely independently of cPython. I think Enthought did some work on this for MS, to create a .net numpy, maybe? I do still like that idea.... But there could be a "core" numpy and a "other stuff that is cPython specific" layer than Cython would be great for. -Chris
--
Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Wed, Mar 5, 2014 at 9:11 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Mar 3, 2014 at 7:20 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
hi,
as the numpy gsoc topic page is a little short on options I was thinking about adding two topics for interested students. But as I have no experience with gsoc or mentoring and the ideas are not very fleshed out yet I'd like to ask if it might make sense at all:
1. configurable algorithm precision [...] with np.precmode(default="fast"): np.abs(complex_array)
or fast everything except sum and hypot
with np.precmode(default="fast", sum="kahan", hypot="standard"): np.sum(d) [...]
Not a big fan of this one -- it seems like the biggest bulk of the effort would be in figuring out a non-horrible API for exposing these things and getting consensus around it, which is not a good fit to the SoC structure.
I'm pretty nervous about the datetime proposal that's currently on the wiki, for similar reasons -- I'm not sure it's actually doable in the SoC context.
2. vector math library integration
This is a great suggestion -- clear scope, clear benefit.
Two more ideas:
3. Using Cython in the numpy core
The numpy core contains tons of complicated C code implementing elaborate operations like indexing, casting, ufunc dispatch, etc. It would be really nice if we could use Cython to write some of these things. However, there is a practical problem: Cython assumes that each .pyx file generates a single compiled module with its own Cython-defined API. Numpy, however, contains a large number of .c files which are all compiled together into a single module, with its own home-brewed system for defining the public API. And we can't rewrite the whole thing. So for this to be viable, we would need some way to compile a bunch of .c *and .pyx* files together into a single module, and allow the .c and .pyx files to call each other. This might involve changes to Cython, some sort of clever post-processing or glue code to get existing cython-generated source code to play nicely with the rest of numpy, or something else.
So this project would have the following goals, depending on how practical this turns out to be: (1) produce a hacky proof-of-concept system for doing the above, (2) turn the hacky proof-of-concept into something actually viable for use in real life (possibly this would require getting changes upstream into Cython, etc.), (3) use this system to actually port some interesting numpy code into cython.
Having to synchronise two projects may be hard for a GSoC, no ? Otherwise, I am a bit worried about cython being used on the current C code as is, because core and python C API are so interwined (especially multiarray). Maybe one could use cython on the non-core numpy parts that are still in C ? It is not as sexy of a project, though.
4. Pythonic dtypes
The current dtype system is klugey. It basically defines its own class system, in parallel to Python's, and unsurprisingly, this new class system is not as good. In particular, it has limitations around the storage of instance-specific data which rule out a large variety of interesting user-defined dtypes, and causes us to need some truly nasty hacks to support the built-in dtypes we do have. And it makes defining a new dtype much more complicated than defining a new Python class.
This project would be to implement a new dtype system for numpy, in which np.dtype becomes a near-empty base class, different dtypes (e.g., float64, float32) are simply different subclasses of np.dtype, and dtype objects are simply instances of these classes. Further enhancements would be to make it possible to define new dtypes in pure Python by subclassing np.dtype and implementing special methods for the various dtype operations, and to make it possible for ufunc loops to see the dtype objects.
This project would provide the key enabling piece for a wide variety of interesting new features: missing value support, better handling of strings and categorical data, unit handling, automatic differentiation, and probably a bunch more I'm forgetting right now.
If we get someone who's up to handling the dtype thing then I can mentor or co-mentor.
What do y'all think?
(I don't think I have access to update that wiki page -- or maybe I'm just not clever enough to figure out how -- so it would be helpful if someone who can, could?)
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Thu, Mar 6, 2014 at 9:11 AM, David Cournapeau <cournape@gmail.com> wrote:
On Wed, Mar 5, 2014 at 9:11 PM, Nathaniel Smith <njs@pobox.com> wrote:
So this project would have the following goals, depending on how practical this turns out to be: (1) produce a hacky proof-of-concept system for doing the above, (2) turn the hacky proof-of-concept into something actually viable for use in real life (possibly this would require getting changes upstream into Cython, etc.), (3) use this system to actually port some interesting numpy code into cython.
Having to synchronise two projects may be hard for a GSoC, no ?
Yeah, if someone is interested in this it would be nice to get someone from Cython involved too. But that's why the primary goal is to produce a proof-of-concept -- even if all that comes out is that we learn that this cannot be done in an acceptable manner, then that's still a succesful (albeit disappointing) result.
Otherwise, I am a bit worried about cython being used on the current C code as is, because core and python C API are so interwined (especially multiarray).
I don't understand this objection. The whole advantage of Cython is that it makes it much, much easier to write code that involves intertwining complex algorithms and heavy use of the Python C API :-). There's tons of bug-prone spaghetti in numpy for doing boring things like refcounting, exception passing, and argument parsing. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Thu, Mar 6, 2014 at 1:59 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Thu, Mar 6, 2014 at 9:11 AM, David Cournapeau <cournape@gmail.com> wrote:
On Wed, Mar 5, 2014 at 9:11 PM, Nathaniel Smith <njs@pobox.com> wrote:
So this project would have the following goals, depending on how practical this turns out to be: (1) produce a hacky proof-of-concept system for doing the above, (2) turn the hacky proof-of-concept into something actually viable for use in real life (possibly this would require getting changes upstream into Cython, etc.), (3) use this system to actually port some interesting numpy code into cython.
Having to synchronise two projects may be hard for a GSoC, no ?
Yeah, if someone is interested in this it would be nice to get someone from Cython involved too. But that's why the primary goal is to produce a proof-of-concept -- even if all that comes out is that we learn that this cannot be done in an acceptable manner, then that's still a succesful (albeit disappointing) result.
Otherwise, I am a bit worried about cython being used on the current C code as is, because core and python C API are so interwined (especially multiarray).
I don't understand this objection. The whole advantage of Cython is that it makes it much, much easier to write code that involves intertwining complex algorithms and heavy use of the Python C API :-).
There's tons of bug-prone spaghetti in numpy for doing boring things
like refcounting, exception passing, and argument parsing.
No argument there, doing refcounting, etc.. manually is a waste of time. Ideally, cython would be used for the boring stuff, and we keep C for the low-level machinery, but the current code don't cleanly separate those two layers (there is simple C API for indexing, ufunc, etc...). I am concerned about cython making that difference even more blurry. David
-n
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Wed, Mar 5, 2014 at 2:11 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Mar 3, 2014 at 7:20 PM, Julian Taylor <jtaylor.debian@googlemail.com> wrote:
hi,
as the numpy gsoc topic page is a little short on options I was thinking about adding two topics for interested students. But as I have no experience with gsoc or mentoring and the ideas are not very fleshed out yet I'd like to ask if it might make sense at all:
1. configurable algorithm precision [...] with np.precmode(default="fast"): np.abs(complex_array)
or fast everything except sum and hypot
with np.precmode(default="fast", sum="kahan", hypot="standard"): np.sum(d) [...]
Not a big fan of this one -- it seems like the biggest bulk of the effort would be in figuring out a non-horrible API for exposing these things and getting consensus around it, which is not a good fit to the SoC structure.
I'm pretty nervous about the datetime proposal that's currently on the wiki, for similar reasons -- I'm not sure it's actually doable in the SoC context.
2. vector math library integration
This is a great suggestion -- clear scope, clear benefit.
Two more ideas:
3. Using Cython in the numpy core
The numpy core contains tons of complicated C code implementing elaborate operations like indexing, casting, ufunc dispatch, etc. It would be really nice if we could use Cython to write some of these things. However, there is a practical problem: Cython assumes that each .pyx file generates a single compiled module with its own Cython-defined API. Numpy, however, contains a large number of .c files which are all compiled together into a single module, with its own home-brewed system for defining the public API. And we can't rewrite the whole thing. So for this to be viable, we would need some way to compile a bunch of .c *and .pyx* files together into a single module, and allow the .c and .pyx files to call each other. This might involve changes to Cython, some sort of clever post-processing or glue code to get existing cython-generated source code to play nicely with the rest of numpy, or something else.
So this project would have the following goals, depending on how practical this turns out to be: (1) produce a hacky proof-of-concept system for doing the above, (2) turn the hacky proof-of-concept into something actually viable for use in real life (possibly this would require getting changes upstream into Cython, etc.), (3) use this system to actually port some interesting numpy code into cython.
If I were to rewrite some Numpy C code in Cython, I'd try _compiled_base.c first.
4. Pythonic dtypes
The current dtype system is klugey. It basically defines its own class system, in parallel to Python's, and unsurprisingly, this new class system is not as good. In particular, it has limitations around the storage of instance-specific data which rule out a large variety of interesting user-defined dtypes, and causes us to need some truly nasty hacks to support the built-in dtypes we do have. And it makes defining a new dtype much more complicated than defining a new Python class.
This project would be to implement a new dtype system for numpy, in which np.dtype becomes a near-empty base class, different dtypes (e.g., float64, float32) are simply different subclasses of np.dtype, and dtype objects are simply instances of these classes. Further enhancements would be to make it possible to define new dtypes in pure Python by subclassing np.dtype and implementing special methods for the various dtype operations, and to make it possible for ufunc loops to see the dtype objects.
This project would provide the key enabling piece for a wide variety of interesting new features: missing value support, better handling of strings and categorical data, unit handling, automatic differentiation, and probably a bunch more I'm forgetting right now.
If we get someone who's up to handling the dtype thing then I can mentor or co-mentor.
What do y'all think?
(I don't think I have access to update that wiki page -- or maybe I'm just not clever enough to figure out how -- so it would be helpful if someone who can, could?)
Another possibility would be plugin random number generators for numpy.random. That would require a student with good deal of expertise though, design takes more experience than coding. Chuck
participants (5)
-
Charles R Harris
-
Chris Barker
-
David Cournapeau
-
Nathaniel Smith
-
Sturla Molden