![](https://secure.gravatar.com/avatar/998f5c5403f3657437a3afbf6a16e24b.jpg?s=120&d=mm&r=g)
Is numpy planning to participate in GSOC this year, either on their own or as a part of another group? If so, should we start trying to get some project suggestions together?
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Tue, Feb 26, 2013 at 11:17 AM, Todd <toddrjen@gmail.com> wrote:
Is numpy planning to participate in GSOC this year, either on their own or as a part of another group?
If we participate, it should be under the PSF organization. I suspect participation for NumPy (and SciPy) largely depends on mentors being available.
If so, should we start trying to get some project suggestions together?
That can't hurt - good project descriptions will be useful not just for GSOC but also for people new to the project looking for ways to contribute. I suggest to use the wiki on Github for that. Ralf
![](https://secure.gravatar.com/avatar/998f5c5403f3657437a3afbf6a16e24b.jpg?s=120&d=mm&r=g)
On Mon, Mar 4, 2013 at 9:41 PM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
I have some ideas, but they may not be suitable for GSOC or may just be terrible ideas, so feel free to reject them: 1. A polar dtype. It would be similar to the complex dtype in that it would have two components, but instead of them being real and imaginary, they would be amplitude and angle. Besides the dtype, there should be either functions or methods to convert between complex and polar dtypes, and existing functions should be prepared to handle the new dtype. I it could be made to be able to handle an arbitrary number of dimensions this would be better yet, but I don't know if this is possible not to mention practical. There is a lot of mathematics, including both signal processing and vector analysis, that is often convenient to work with in this format. 2. We discussed this before, but right now subclasses of ndarray don't have any way to preserve their class attributes when using functions that work on multiple ndarrays, such as with concatenate. The current __array_finalize__ method only takes a single array. This project would be to work out a method to handle this sort of situation, perhaps requiring a new method, and making sure numpy methods and functions properly invoke it. 3. Structured arrays are accessed in a manner similar to python dictionaries, using a key. However, they don't support the normal python dictionary methods like keys, values, items, iterkeys, itervalues, iteritems, etc. This project would be to implement as much of the dictionary (and ordereddict) API as possible in structured arrays (making sure that the resulting API presented to the user takes into account whether python 2 or python 3 is being used). 4. The numpy ndarray class stores data in a regular manner in memory. This makes many linear algebra operations easier, but makes changing the number of elements in an array nearly impossible in practice unless you are very careful. There are other data structures that make adding and removing elements easier, but are not as efficient at linear algebra operations. The purpose of this project would be to create such a class in numpy, one that is duck type compatible with ndarray but makes resizing feasible. This would obviously come at a performance penalty for linear algebra related functions. They would still have consistent dtypes and could not be nested, unlike python lists. This could either be based on a new c-based type or be a subclass of list under the hood. 5. Currently dtypes are limited to a set of fixed types, or combinations of these types. You can't have, say, a 48 bit float or a 1-bit bool. This project would be to allow users to create entirely new, non-standard dtypes based on simple rules, such as specifying the length of the sign, length of the exponent, and length of the mantissa for a custom floating-point number. Hopefully this would mostly be used for reading in non-standard data and not used that often, but for some situations it could be useful for storing data too (such as large amounts of boolean data, or genetic code which can be stored in 2 bits and is often very large).
![](https://secure.gravatar.com/avatar/dce2259ff9b547103d54acf1ea622314.jpg?s=120&d=mm&r=g)
On Mon, Mar 4, 2013 at 2:29 PM, Todd <toddrjen@gmail.com> wrote:
I second this general idea. Simply having a pair of packbits/unpackbits functions that could work with 2 and 4 bit uints would make my life easier. If it were possible to have an array of dtype 'uint4' that used half the space of a 'uint8', but could have ufuncs an the like ran on it, it would be pure bliss. Not that I'm complaining, but a man can dream... Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
![](https://secure.gravatar.com/avatar/349e93a0d84ba9a5b3e117bba082f6ce.jpg?s=120&d=mm&r=g)
I also think this would make a great addition to NumPy. People may even be able to save some work by leveraging the HDF5 code base; the HDF5 guys have piles and piles of carefully tested C code for exactly this purpose; converting between the common IEEE float sizes and those with user-specified mantissa/exponents; 1, 2, 3 bit etc. integers and the like. It's all under a BSD-compatible license. You'd have to replace the bits which talk to the HDF5 type description system, but it might be a good place to start. Andrew
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On 4 Mar 2013 23:21, "Jaime Fernández del Río" <jaime.frio@gmail.com> wrote:
On Mon, Mar 4, 2013 at 2:29 PM, Todd <toddrjen@gmail.com> wrote:
5. Currently dtypes are limited to a set of fixed types, or combinations
of these types. You can't have, say, a 48 bit float or a 1-bit bool. This project would be to allow users to create entirely new, non-standard dtypes based on simple rules, such as specifying the length of the sign, length of the exponent, and length of the mantissa for a custom floating-point number. Hopefully this would mostly be used for reading in non-standard data and not used that often, but for some situations it could be useful for storing data too (such as large amounts of boolean data, or genetic code which can be stored in 2 bits and is often very large).
I second this general idea. Simply having a pair of packbits/unpackbits
functions that could work with 2 and 4 bit uints would make my life easier. If it were possible to have an array of dtype 'uint4' that used half the space of a 'uint8', but could have ufuncs an the like ran on it, it would be pure bliss. Not that I'm complaining, but a man can dream... This would be quite difficult, since it would require reworking the guts of the ndarray data structure to store strides and buffer offsets in bits rather than bytes, and probably with endianness handling too. Indexing is all done at the ndarray buffer-of-bytes layer, without any involvement of the dtype. Consider: a = zeros(10, dtype=uint4) b = a[1::3] Now b is a view onto a discontiguous set of half-bytes within a... You could have a dtype that represented several uint4s that together added up to an integral number of bytes, sort of like a structured dtype. Or packbits()/unpackbits(), like you say. -n
![](https://secure.gravatar.com/avatar/998f5c5403f3657437a3afbf6a16e24b.jpg?s=120&d=mm&r=g)
On Mar 5, 2013 7:53 PM, "Nathaniel Smith" <njs@pobox.com> wrote:
On 4 Mar 2013 23:21, "Jaime Fernández del Río" <jaime.frio@gmail.com>
wrote: the sign, length of the exponent, and length of the mantissa for a custom floating-point number. Hopefully this would mostly be used for reading in non-standard data and not used that often, but for some situations it could be useful for storing data too (such as large amounts of boolean data, or genetic code which can be stored in 2 bits and is often very large). the dtype.
added up to an integral number of bytes, sort of like a structured dtype. Or packbits()/unpackbits(), like you say.
-n
Then perhaps such a project could be a four-stage thing. 1. Allow for the creation of int, unit, float, bool, and complex dtypes with an arbitrary number of bytes 2. Allow for the creation of dtypes which are integer fractions of a byte, (1, 2, or 4 bits), and must be padded to a whole byte. 3. Have an optional internal value in an array that tells it to exclude the last n bits of the last byte. This would be used to hide the padding from step 2. This should be abstracted into a general-purpose method for excluding bits from the byte-to-dtype conversion so it can be used in step 4. 4. Allow for the creation of dtypes that are non-integer fractions of a byte or non-integer multiples of a byte (3, 5, 6, 7, 9, 10, 11, 12, etc, bits). Each element in the array would be stored as a certain number of bytes, with the method from 3 used to cut it down to the right number of bits. So a 3 bit dtype would have two elements per byte with 2 bits excluded. A 5 bit dtype would have 1 element per byte with 3 bits excluded. A 12 bit dtype would have one element in two bytes with with 4 bits excluded from the second byte. This approach would allow for arbitrary numbers of bits without breaking the internal representation, would have each stage building off the previous stage, and we would still have something useful even if not all the stages are completed.
![](https://secure.gravatar.com/avatar/468f182e9d98bb9d3545dd87faf5d8c2.jpg?s=120&d=mm&r=g)
Todd <toddrjen <at> gmail.com> writes:
I have some ideas, but they may not be suitable for GSOC or may just be
terrible ideas, so feel free to reject them:
I have also a possible (terrible?) idea in my mind: Including (maybe optional as blas) faster transcendental functions into numpy. Something like https://github.com/herumi/fmath or using the MKL. I think numpy just uses the standard std functions, whiche are not optimized for speed. greetings Till
![](https://secure.gravatar.com/avatar/15c05480eda96586c521cf0df92ec41c.jpg?s=120&d=mm&r=g)
On Mon, Mar 4, 2013 at 4:29 PM, Todd <toddrjen@gmail.com> wrote:
Along these lines: what about implementing the new "memory friendly" dictionary [0] with a NumPy structured array backend for the dense array portion, and allowing any specified column of the array to be the dictionary keys? This would merge the strengths of NumPy structured arrays with Python dictionaries. Some thought would have to be given to mutability / immutability issues, but these are surmountable. Further enhancements would be to allow for multiple key columns -- analogous to multiple indexes into a database. [0] http://mail.python.org/pipermail/python-dev/2012-December/123028.html
This made me think of a serious performance limitation of structured dtypes: a structured dtype is always "packed", which may lead to terrible byte alignment for common types. For instance, `dtype([('a', 'u1'), ('b', 'u8')]).itemsize == 9`, meaning that the 8-byte integer is not aligned as an equivalent C-struct's would be, leading to all sorts of horrors at the cache and register level. Python's ctypes does the right thing here, and can be mined for ideas. For instance, the equivalent ctypes Structure adds pad bytes so the 8-byte integer is on the correct boundary: class Aligned(ctypes.Structure): _fields_ = [('a', ctypes.c_uint8), ('b', ctypes.c_uint64)] print ctypes.sizeof(Aligned()) # --> 16 I'd be surprised if someone hasn't already proposed fixing this, although perhaps this would be outside the scope of a GSOC project. I'm willing to wager that the performance improvements would be easily measureable. Just some more thoughts. Kurt
![](https://secure.gravatar.com/avatar/05fc6835e821894b1ff75c46391eed7b.jpg?s=120&d=mm&r=g)
I've been confronted to this very problem and ended up coding a "group class" which is a "split" structured array (each field is stored as a single array) offering the same interface as a regular structured array. http://www.loria.fr/~rougier/coding/software/numpy_group.py Nicolas
![](https://secure.gravatar.com/avatar/38153b4768acea6b89aed9f19a0a5243.jpg?s=120&d=mm&r=g)
On 3/5/13 7:14 PM, Kurt Smith wrote:
I would not run too much. The example above takes 9 bytes to host the structure, while a `aligned=True` will take 16 bytes. I'd rather let the default as it is, and in case performance is critical, you can always copy the unaligned field to a new (homogeneous) array. -- Francesc Alted
![](https://secure.gravatar.com/avatar/15c05480eda96586c521cf0df92ec41c.jpg?s=120&d=mm&r=g)
On Wed, Mar 6, 2013 at 4:29 AM, Francesc Alted <francesc@continuum.io> wrote:
Yes, I can absolutely see the case you're making here, and I made my "vote" with the understanding that `aligned=False` will almost certainly stay the default. Adding 'aligned=True' is simple for me to do, so no harm done. My case is based on what's the least surprising behavior: C structs / all C compilers, the builtin `struct` module, and ctypes `Structure` subclasses all use padding to ensure aligned fields by default. You can turn this off to get packed structures, but the default behavior in these other places is alignment, which is why I was surprised when I first saw that NumPy structured dtypes are packed by default.
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Tue, Feb 26, 2013 at 11:17 AM, Todd <toddrjen@gmail.com> wrote:
Is numpy planning to participate in GSOC this year, either on their own or as a part of another group?
If we participate, it should be under the PSF organization. I suspect participation for NumPy (and SciPy) largely depends on mentors being available.
If so, should we start trying to get some project suggestions together?
That can't hurt - good project descriptions will be useful not just for GSOC but also for people new to the project looking for ways to contribute. I suggest to use the wiki on Github for that. Ralf
![](https://secure.gravatar.com/avatar/998f5c5403f3657437a3afbf6a16e24b.jpg?s=120&d=mm&r=g)
On Mon, Mar 4, 2013 at 9:41 PM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
I have some ideas, but they may not be suitable for GSOC or may just be terrible ideas, so feel free to reject them: 1. A polar dtype. It would be similar to the complex dtype in that it would have two components, but instead of them being real and imaginary, they would be amplitude and angle. Besides the dtype, there should be either functions or methods to convert between complex and polar dtypes, and existing functions should be prepared to handle the new dtype. I it could be made to be able to handle an arbitrary number of dimensions this would be better yet, but I don't know if this is possible not to mention practical. There is a lot of mathematics, including both signal processing and vector analysis, that is often convenient to work with in this format. 2. We discussed this before, but right now subclasses of ndarray don't have any way to preserve their class attributes when using functions that work on multiple ndarrays, such as with concatenate. The current __array_finalize__ method only takes a single array. This project would be to work out a method to handle this sort of situation, perhaps requiring a new method, and making sure numpy methods and functions properly invoke it. 3. Structured arrays are accessed in a manner similar to python dictionaries, using a key. However, they don't support the normal python dictionary methods like keys, values, items, iterkeys, itervalues, iteritems, etc. This project would be to implement as much of the dictionary (and ordereddict) API as possible in structured arrays (making sure that the resulting API presented to the user takes into account whether python 2 or python 3 is being used). 4. The numpy ndarray class stores data in a regular manner in memory. This makes many linear algebra operations easier, but makes changing the number of elements in an array nearly impossible in practice unless you are very careful. There are other data structures that make adding and removing elements easier, but are not as efficient at linear algebra operations. The purpose of this project would be to create such a class in numpy, one that is duck type compatible with ndarray but makes resizing feasible. This would obviously come at a performance penalty for linear algebra related functions. They would still have consistent dtypes and could not be nested, unlike python lists. This could either be based on a new c-based type or be a subclass of list under the hood. 5. Currently dtypes are limited to a set of fixed types, or combinations of these types. You can't have, say, a 48 bit float or a 1-bit bool. This project would be to allow users to create entirely new, non-standard dtypes based on simple rules, such as specifying the length of the sign, length of the exponent, and length of the mantissa for a custom floating-point number. Hopefully this would mostly be used for reading in non-standard data and not used that often, but for some situations it could be useful for storing data too (such as large amounts of boolean data, or genetic code which can be stored in 2 bits and is often very large).
![](https://secure.gravatar.com/avatar/dce2259ff9b547103d54acf1ea622314.jpg?s=120&d=mm&r=g)
On Mon, Mar 4, 2013 at 2:29 PM, Todd <toddrjen@gmail.com> wrote:
I second this general idea. Simply having a pair of packbits/unpackbits functions that could work with 2 and 4 bit uints would make my life easier. If it were possible to have an array of dtype 'uint4' that used half the space of a 'uint8', but could have ufuncs an the like ran on it, it would be pure bliss. Not that I'm complaining, but a man can dream... Jaime -- (\__/) ( O.o) ( > <) Este es Conejo. Copia a Conejo en tu firma y ayúdale en sus planes de dominación mundial.
![](https://secure.gravatar.com/avatar/349e93a0d84ba9a5b3e117bba082f6ce.jpg?s=120&d=mm&r=g)
I also think this would make a great addition to NumPy. People may even be able to save some work by leveraging the HDF5 code base; the HDF5 guys have piles and piles of carefully tested C code for exactly this purpose; converting between the common IEEE float sizes and those with user-specified mantissa/exponents; 1, 2, 3 bit etc. integers and the like. It's all under a BSD-compatible license. You'd have to replace the bits which talk to the HDF5 type description system, but it might be a good place to start. Andrew
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On 4 Mar 2013 23:21, "Jaime Fernández del Río" <jaime.frio@gmail.com> wrote:
On Mon, Mar 4, 2013 at 2:29 PM, Todd <toddrjen@gmail.com> wrote:
5. Currently dtypes are limited to a set of fixed types, or combinations
of these types. You can't have, say, a 48 bit float or a 1-bit bool. This project would be to allow users to create entirely new, non-standard dtypes based on simple rules, such as specifying the length of the sign, length of the exponent, and length of the mantissa for a custom floating-point number. Hopefully this would mostly be used for reading in non-standard data and not used that often, but for some situations it could be useful for storing data too (such as large amounts of boolean data, or genetic code which can be stored in 2 bits and is often very large).
I second this general idea. Simply having a pair of packbits/unpackbits
functions that could work with 2 and 4 bit uints would make my life easier. If it were possible to have an array of dtype 'uint4' that used half the space of a 'uint8', but could have ufuncs an the like ran on it, it would be pure bliss. Not that I'm complaining, but a man can dream... This would be quite difficult, since it would require reworking the guts of the ndarray data structure to store strides and buffer offsets in bits rather than bytes, and probably with endianness handling too. Indexing is all done at the ndarray buffer-of-bytes layer, without any involvement of the dtype. Consider: a = zeros(10, dtype=uint4) b = a[1::3] Now b is a view onto a discontiguous set of half-bytes within a... You could have a dtype that represented several uint4s that together added up to an integral number of bytes, sort of like a structured dtype. Or packbits()/unpackbits(), like you say. -n
![](https://secure.gravatar.com/avatar/998f5c5403f3657437a3afbf6a16e24b.jpg?s=120&d=mm&r=g)
On Mar 5, 2013 7:53 PM, "Nathaniel Smith" <njs@pobox.com> wrote:
On 4 Mar 2013 23:21, "Jaime Fernández del Río" <jaime.frio@gmail.com>
wrote: the sign, length of the exponent, and length of the mantissa for a custom floating-point number. Hopefully this would mostly be used for reading in non-standard data and not used that often, but for some situations it could be useful for storing data too (such as large amounts of boolean data, or genetic code which can be stored in 2 bits and is often very large). the dtype.
added up to an integral number of bytes, sort of like a structured dtype. Or packbits()/unpackbits(), like you say.
-n
Then perhaps such a project could be a four-stage thing. 1. Allow for the creation of int, unit, float, bool, and complex dtypes with an arbitrary number of bytes 2. Allow for the creation of dtypes which are integer fractions of a byte, (1, 2, or 4 bits), and must be padded to a whole byte. 3. Have an optional internal value in an array that tells it to exclude the last n bits of the last byte. This would be used to hide the padding from step 2. This should be abstracted into a general-purpose method for excluding bits from the byte-to-dtype conversion so it can be used in step 4. 4. Allow for the creation of dtypes that are non-integer fractions of a byte or non-integer multiples of a byte (3, 5, 6, 7, 9, 10, 11, 12, etc, bits). Each element in the array would be stored as a certain number of bytes, with the method from 3 used to cut it down to the right number of bits. So a 3 bit dtype would have two elements per byte with 2 bits excluded. A 5 bit dtype would have 1 element per byte with 3 bits excluded. A 12 bit dtype would have one element in two bytes with with 4 bits excluded from the second byte. This approach would allow for arbitrary numbers of bits without breaking the internal representation, would have each stage building off the previous stage, and we would still have something useful even if not all the stages are completed.
![](https://secure.gravatar.com/avatar/468f182e9d98bb9d3545dd87faf5d8c2.jpg?s=120&d=mm&r=g)
Todd <toddrjen <at> gmail.com> writes:
I have some ideas, but they may not be suitable for GSOC or may just be
terrible ideas, so feel free to reject them:
I have also a possible (terrible?) idea in my mind: Including (maybe optional as blas) faster transcendental functions into numpy. Something like https://github.com/herumi/fmath or using the MKL. I think numpy just uses the standard std functions, whiche are not optimized for speed. greetings Till
![](https://secure.gravatar.com/avatar/15c05480eda96586c521cf0df92ec41c.jpg?s=120&d=mm&r=g)
On Mon, Mar 4, 2013 at 4:29 PM, Todd <toddrjen@gmail.com> wrote:
Along these lines: what about implementing the new "memory friendly" dictionary [0] with a NumPy structured array backend for the dense array portion, and allowing any specified column of the array to be the dictionary keys? This would merge the strengths of NumPy structured arrays with Python dictionaries. Some thought would have to be given to mutability / immutability issues, but these are surmountable. Further enhancements would be to allow for multiple key columns -- analogous to multiple indexes into a database. [0] http://mail.python.org/pipermail/python-dev/2012-December/123028.html
This made me think of a serious performance limitation of structured dtypes: a structured dtype is always "packed", which may lead to terrible byte alignment for common types. For instance, `dtype([('a', 'u1'), ('b', 'u8')]).itemsize == 9`, meaning that the 8-byte integer is not aligned as an equivalent C-struct's would be, leading to all sorts of horrors at the cache and register level. Python's ctypes does the right thing here, and can be mined for ideas. For instance, the equivalent ctypes Structure adds pad bytes so the 8-byte integer is on the correct boundary: class Aligned(ctypes.Structure): _fields_ = [('a', ctypes.c_uint8), ('b', ctypes.c_uint64)] print ctypes.sizeof(Aligned()) # --> 16 I'd be surprised if someone hasn't already proposed fixing this, although perhaps this would be outside the scope of a GSOC project. I'm willing to wager that the performance improvements would be easily measureable. Just some more thoughts. Kurt
![](https://secure.gravatar.com/avatar/05fc6835e821894b1ff75c46391eed7b.jpg?s=120&d=mm&r=g)
I've been confronted to this very problem and ended up coding a "group class" which is a "split" structured array (each field is stored as a single array) offering the same interface as a regular structured array. http://www.loria.fr/~rougier/coding/software/numpy_group.py Nicolas
![](https://secure.gravatar.com/avatar/38153b4768acea6b89aed9f19a0a5243.jpg?s=120&d=mm&r=g)
On 3/5/13 7:14 PM, Kurt Smith wrote:
I would not run too much. The example above takes 9 bytes to host the structure, while a `aligned=True` will take 16 bytes. I'd rather let the default as it is, and in case performance is critical, you can always copy the unaligned field to a new (homogeneous) array. -- Francesc Alted
![](https://secure.gravatar.com/avatar/15c05480eda96586c521cf0df92ec41c.jpg?s=120&d=mm&r=g)
On Wed, Mar 6, 2013 at 4:29 AM, Francesc Alted <francesc@continuum.io> wrote:
Yes, I can absolutely see the case you're making here, and I made my "vote" with the understanding that `aligned=False` will almost certainly stay the default. Adding 'aligned=True' is simple for me to do, so no harm done. My case is based on what's the least surprising behavior: C structs / all C compilers, the builtin `struct` module, and ctypes `Structure` subclasses all use padding to ensure aligned fields by default. You can turn this off to get packed structures, but the default behavior in these other places is alignment, which is why I was surprised when I first saw that NumPy structured dtypes are packed by default.
participants (10)
-
Andrew Collette
-
Eric Firing
-
Francesc Alted
-
Jaime Fernández del Río
-
Kurt Smith
-
Nathaniel Smith
-
Nicolas Rougier
-
Ralf Gommers
-
Till Stensitzki
-
Todd