[Numpy-discussion] GSOC 2013

Mon Mar 4 23:49:30 EST 2013

On Mon, Mar 4, 2013 at 4:29 PM, Todd <toddrjen at gmail.com> wrote:
>
> 3. Structured arrays are accessed in a manner similar to python dictionaries,
> using a key.  However, they don't support the normal python dictionary
> methods like keys, values, items, iterkeys, itervalues, iteritems, etc.  This
> project would be to implement as much of the dictionary (and ordereddict) API
> as possible in structured arrays (making sure that the resulting API
> presented to the user takes into account whether python 2 or python 3 is
> being used).

Along these lines: what about implementing the new "memory friendly" dictionary
[0] with a NumPy structured array backend for the dense array portion, and
allowing any specified column of the array to be the dictionary keys?  This
would merge the strengths of NumPy structured arrays with Python dictionaries.
Some thought would have to be given to mutability / immutability issues, but
these are surmountable.  Further enhancements would be to allow for multiple
key columns -- analogous to multiple indexes into a database.

[0] http://mail.python.org/pipermail/python-dev/2012-December/123028.html

>
> 4. The numpy ndarray class stores data in a regular manner in memory.  This
> makes many linear algebra operations easier, but makes changing the number
> of elements in an array nearly impossible in practice unless you are very
> careful.  There are other data structures that make adding and removing
> elements easier, but are not as efficient at linear algebra operations.  The
> purpose of this project would be to create such a class in numpy, one that
> is duck type compatible with ndarray but makes resizing feasible.  This
> would obviously come at a performance penalty for linear algebra related
> functions.  They would still have consistent dtypes and could not be nested,
> unlike python lists.  This could either be based on a new c-based type or be
> a subclass of list under the hood.

This made me think of a serious performance limitation of structured dtypes: a
structured dtype is always "packed", which may lead to terrible byte alignment
for common types.  For instance, `dtype([('a', 'u1'), ('b',
'u8')]).itemsize == 9`,
meaning that the 8-byte integer is not aligned as an equivalent C-struct's
would be, leading to all sorts of horrors at the cache and register level.
Python's ctypes does the right thing here, and can be mined for ideas.   For
instance, the equivalent ctypes Structure adds pad bytes so the 8-byte integer
is on the correct boundary:

    class Aligned(ctypes.Structure):
        _fields_ = [('a', ctypes.c_uint8),
                    ('b', ctypes.c_uint64)]

    print ctypes.sizeof(Aligned()) # --> 16

I'd be surprised if someone hasn't already proposed fixing this, although
perhaps this would be outside the scope of a GSOC project.  I'm willing to
wager that the performance improvements would be easily measureable.

Just some more thoughts.

Kurt