[Numpy-discussion] named ndarray axes

Wed Jul 13 19:15:06 EDT 2011

Yup exactly.  To enable this sort of tracking I needed to explicitly reverse-engineer the effects of indexing on axes.  I figure overriding indexing catches most cases that modify axes, but other holes need to be plugged as well... ie: tranpose, swapaxes.  This probably means most C functions that change array axes (np.mean(axis=), etc.) need to be covered as well.... that sucks.  

BTW, it sounds like you're trying to track very similar data.  I am trying to load structural biology data formats, and I try to preserve as much of the metadata as possible, ie: I am encoding unit cell length/angle information as vectors, etc.

Here is my implementation:

    def __getitem__(self,idxs):
        idxs,trans,keep = idx_axes(self,idxs)
        array = self.view(np.ndarray).transpose(trans)
        array = array[idxs]
        if isinstance(array,ndarray):
            array = array.view(self.__class__)
            array.axes = self.axes.transpose(keep)
        return array

def idx_axes(array,idxs):

    # explicitly expand ellipsis
    expanded_idxs = idx_expanded(array.ndim,idxs)

    # determine how the axes will be rearranged as a result of axes-based indexing
    # and the creation of newaxes
    remapped_axes = idx_axes_remapped(array.ndim,array.axes,expanded_idxs)

    # determine numpy compatible transpose, before newaxes are created
    transposed_axes = idx_axes_transposed(remapped_axes)

    # determine numpy compatible indexes with axes-based indexing removed
    filtered_idxs = idx_filtered(expanded_idxs)

    # determine which axes will be kept after numpy indexing
    kept_axes = idx_axes_kept(remapped_axes,filtered_idxs)

    return filtered_idxs,transposed_axes,kept_axes

def idx_expanded(ndim,idxs):
    '''
    explicitly expands ellipsis taking into account newaxes
    '''

    if not isinstance(idxs,tuple):
        return idx_expanded(ndim,(idxs,))

    # how many dimensions we will end up having
    ndim = ndim + idxs.count(newaxis)

    filler = slice(None)

    def skip_ellipsis(idxs):
        return tuple([filler if isinstance(x,type(Ellipsis)) else x for x in idxs])

    def fill_ellipsis(ndim,l,r):
        return (filler,)*(ndim-len(l)-len(r))

    # expand first ellipsis, treat all other ellipsis as slices
    if Ellipsis in idxs:
        idx = idxs.index(Ellipsis)
        llist = idxs[:idx]
        rlist = skip_ellipsis(idxs[idx+1:])
        cfill = fill_ellipsis(ndim,llist,rlist)
        idxs = llist + cfill + rlist

    return idxs

def idx_filtered(idxs):
    '''
    replace indexes on axes with normal numpy indexes
    '''
    def axisAsIdx(idx):
        if isinstance(idx,str):
            return slice(None)
        elif isinstance(idx,slice):
            if isinstance(idx.start,str):
                return idx.stop
        return idx

    return tuple([axisAsIdx(x) for x in idxs])

def idx_axes_remapped(ndim,axes,idxs):
    '''
    if given a set of array indexes that contain labeled axes,
    return a tuple that maps axes in the source array to the axes
    as they will end up in the destination array.  Must take into
    account the spaces created by newaxis indexes.
    '''

    # how many dims are we expecting?
    ndim = ndim + idxs.count(newaxis)

    # new unique object for marking unassigned locations in mapping
    unassigned = object()

    # by default all locations are unsassigned
    mapping = [unassigned] * ndim

    # find labels in indexes and set the dims for those locations
    for dim,label in enumerate(idxs):
        if label == newaxis:
            mapping[dim] = label
        elif isinstance(label,str):
            mapping[dim] = axes.dimForLabel(label)
        elif isinstance(label,slice):
            if isinstance(label.start,str):
                mapping[dim] = axes.dimForLabel(label.start)

    # find unassigned dims, in order
    unmapped = [d for d in range(ndim) if d not in set(mapping)]

    # fill in remaining unassigned locations with dims
    for dst,src in enumerate(mapping):
        if src == unassigned:
            mapping[dst] = unmapped.pop(0)

    return tuple(mapping)

def idx_axes_transposed(mapping):
    '''
    stripping out newaxes in mapping yields a tuple compatible with transpose
    '''
    return tuple([x for x in mapping if x != newaxis])

def idx_axes_kept(mapping,idxs):
    '''
    remove axes from mapping that will not survive the indexing (ie: ints)
    '''
    kept = []
    first_list = True
    for dst,src in enumerate(mapping):
        if dst < len(idxs):
            idx = idxs[dst]
            if isinstance(idx,int):
                continue
            elif isinstance(idx,list):
                if not first_list:
                    continue
                first_list = False
        kept += [src]
    return tuple(kept)

On Jul 13, 2011, at 3:13 PM, Sam Quinan wrote:

> I'm currently working on interfacing ndarrays with a custom C-representation
> for n-dimensional arrays. My custom C code provides additional per-axis
> information (labeling, spacing between samples / range of sample positions
> along the axis, axis direction, cell vs.node centering, etc.) Subclassing
> ndarray to hold onto this info is fairly simple, but getting numpy's methods
> to intelligently modify that information when the array is sliced is
> something that I'm still trying to figure out.
> 
> A robust way to attach per-axis info to a given ndarray, whether it just be
> a label or some more complex structure, would definitely be something I (and
> likely others) would find useful...
> 
> That said, I'd love to know more about how the idx_axes() structure in your
> workaround works...
> 
> - Sam
> 
> 
> 
> On 7/13/11 12:00 PM, "numpy-discussion-request at scipy.org"
> <numpy-discussion-request at scipy.org> wrote:
> 
>> Date: Tue, 12 Jul 2011 16:39:47 -0700
>> From: Craig Yoshioka <craigyk at me.com>
>> Subject: [Numpy-discussion] named ndarray axes
>> To: NumPy-Discussion at scipy.org
>> Message-ID: <0FC8B43E-26CD-40ED-A6FA-59DD8D641998 at me.com>
>> Content-Type: text/plain; CHARSET=US-ASCII
>> 
>> I brought up a while ago about how it would be nice if numpy arrays could have
>> their axes 'labeled'.    = I got an implementation that works pretty well for
>> me and in the process learned quite a few things, and was hoping to foster
>> some more discussion on this topic, as I think I have found a simple/flexible
>> solution to support this at the numpy level.
>> 
>> Here are *some* examples code snippets from my unit tests on 'Array':
>> 
>>    a = Array((4,5,6))
>> 
>>    # you can assign data to all axes by attribute:
>>    a.Axes.Label = (0:'z',1:'y',2:'x'}
>> 
>>    # or add metadata to each individually:
>>    a.Axes[1].Vector = [0,1,0]
>>    a.Axes[2].Vector = [0,0,1]
>> 
>>    # simple case int indexing
>>    b = a[0]
>>    assert b.shape == (5,6)
>>    assert b.Axes.Label == {0:'y',1:'x'}
>>    assert b.Axes.Vector == {0:[0,1,0],1:[0,0,1]}
>> 
>>    # indexing with slices
>>    b = a[:,0,:]
>>    assert b.shape == (4,6)
>>    assert b.Axes.Label == {0:'z',1:'x'}
>>    assert b.Axes.Vector == {1:[0,0,1]}
>> 
>>    # indexing with ellipsis
>>    b = a[...,0]
>>    assert b.shape == (4,5)
>>    assert b.Axes.Label == {0:'z',1:'y'}
>> 
>>    # indexing with ellipsis, newaxis, etc.
>>    b = a[newaxis,...,2,newaxis]
>>    assert b.shape == (1,4,5,1)
>>    assert b.Axes.Label == {1:'z',2:'y'}
>> 
>>    # indexing with lists
>>    b = a[[1,2],:,[1,2]]
>>    assert b.shape == (2,5)
>>    assert b.Axes.Label == {0:'z',1:'y'}
>> 
>>    # most interesting examples, indexing with axes labels----------------
>>    # I was a bit confused about how to handle indexing with mixed
>> axes/non-axes indexes
>>    # IE: what does a['x',2:4]  mean?  on what axis is the 2:4 slice being
>> applied, the first? the first after 'x'?
>>    #       One option is to disallow mixing (simpler to implement,
>> understand?)
>>    #       Instead I chose to treat the axis indexing as a forced assignment
>> of an axis to a position.
>> 
>>    # axis indexing that transposes the first two dimensions, but otherwise
>> does nothing
>>    b = a['y','z']
>>    assert b.shape == (5,4,6)
>>    assert b.Axes.Label == {0:'y',1:'z',2:'x'}
>> 
>>    # abusing slices to allow specifying indexes for axes
>>    b = a['y':0,'z']
>>    assert b.shape == (4,6)
>>    assert b.Axes.Label == {0:'z',1:'x'}
>> 
>>    # unfortunately that means a slice index on an axis must be written like
>> so:
>>    b = a['y':slice(0,2),'x','z']
>>    assert b.shape == (2,6,4)
>>    assert b.Axes.Label == {0:'y',1:'x',2:'z'}
>> 
>>    b = a['y':[1,2,3],'x','z':slice(0:1)]
>>    # or due to the forced transposition, this is the same as:
>>    c = a['y','x','z'][[1,2,3],:,0:1]
>> 
>>    assert b.shape == (3,6,1)
>>    assert b.Axes.Label == {0:'y',1:'x',2:'z'}
>>    assert b.shape == c.shape
>>    assert b.Axes == c.Axes
>> 
>> 
>> #-----------------------------------------------------------------------------
>> -----------
>> 
>> 
>> To do all this I essentially had to recreate the effects of numpy indexing on
>> axes....  This is not ideal, but so far I seem to have addressed most of the
>> indexing I've used, at least. Here is what __getitem__ looks like:
>> 
>>    def __getitem__(self,idxs):
>>        filtered_idxs,transposed_axes,kept_axes = self.idx_axes(idxs)
>>        array = self.view(ndarray).transpose(transposed_axes)
>>        array = array[filtered_idxs]
>>        if isinstance(array,ndarray):
>>            array = array.view(Array)
>>            array.Axes = self.Axes.keep(kept_axes)
>>        return array
>> 
>> As you can see idx_axes() essentially recreates a lot of ndarray indexing
>> behavior, so that its effects can be explicitly handled.
>> 
>> Having done all this, I think the best way for numpy to support 'labeled' axes
>> in the future is by having numpy itself keep track of a very simple tuple
>> attribute, like shape, and leave more complex axis naming/labeling to
>> subclasses on the python side.  As an example, upon creating a new dimension
>> in an array, numpy assigns that dimension a semi-unique id, and this tuple
>> could be used in __array_finalize__.
>> 
>> For example my __array_finalize__ could look like:
>> 
>> def __array_finalize__(self,obj):
>>    if hasattr(obj,'axesdata'):
>>         for axesid in self.axes:
>>              if axesid in obj.axes:
>>                   self.axesdata[axesid] = obj.axesdata[axesid]
>> 
>> 
>> This would cover a lot more situations and lead to much simpler code since the
>> work required on the C side would be minimal, but still allow robust and
>> custom tracking and propagation of axes information.
>> Subclasses that tap into this data would react to the result of numpy
>> operations vs. having to predict/anticipate.
>> 
>> For example, my __getitem__, relying on the __array_finalize__ above, could
>> look like:
>> 
>>    def __getitem__(self,idxs):
>>        filtered_idxs,transposed_axes= self.idx_axes(idxs)
>>        array = self.transpose(transposed_axes)
>>        return array[filtered_idxs]
>> 
>> Not shown is how much simpler and robust the code for idx_axes would then be.
>> I estimate it would go from 130 loc to < 20 loc.
>> 
>> Sorry for the extra long e-mail,
>> -Craig
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion