New subject: kdf file format I/O routine

Jan. 25, 2002

      Having written all this down, I'm half-expecting somebody is going to
point out some simple way of doing this with existing builtin functions,
but here goes...

In the process of 'porting' Konrad Hinsen's Histogram class (from the
Scientific package) to N dimensions, I ran across the problem of how to
apply a function to an array for a specified set of indices.  In this
case, the problem is to increment an N-dimensional array for each of a set
of indices.  In general, it occurred to me that it would be useful to have
a function (named 'amap' here) to loop over a specified set of indices,
applying a function to a set of arguments taken from a set of argument
arrays (similar to the map buitin function in this respect), and filling
in the corresponding elements in a target array with the results.  For
example, here is a bit of the histogram problem:

[The original 1D Histogram class uses a clever array-broadcasting trick
which creates intermediate rank-2 arrays, feeding in the data in chunks to
avoid quadratic behaviour.  This is fine with 1D arrays, but in poor taste
for the N dimensional case, if it works at all.]

#!/usr/bin/env python
import Numeric as N

def flatten_indices(shape, indices):
    dims = indices.shape[0]
    facs = pow(shape, N.arange(dims))[::-1,N.NewAxis]
    return N.add.reduce(facs*indices)

def amap(target, function, indices, *arg_arrays):
    indices = N.asarray(indices)
    arg_arrays = map(N.asarray, arg_arrays)
    if not isinstance(target, N.ArrayType):
        raise TypeError, "target must be an array"
    if not target.iscontiguous():
        raise ValueError, "target must be contiguous"
    if not N.alltrue([aa.iscontiguous() for aa in arg_arrays]):
        raise ValueError, "all arg_arrays must be contiguous"
    if not len(indices.shape) == 1:
        raise ValueError, "indices must be rank-1"
    tf = target.flat
    for index in indices:
        args = [arg_array.flat[index] for arg_array in arg_arrays]
        tf[index] = function(*args)

# construct a 2 dimensional histogram given indices, ind, of data points
nbins = (5, 4)  # 5 bins along one axis, 4 along the other
histo = N.zeros(nbins)  # empty histogram
# these are the indices into the histogram at which data points lie:
# histo[1,0] should be incremented twice, histo[2,2] should be incremented
# once, etc.
ind = N.array([[1, 1, 2, 4],
               [0, 0, 2, 3]])
flat_ind = flatten_indices(histo.shape, ind)
assert flat_ind == N.array([4, 4, 10, 19])
# Now we want to increment histo for each index pair ind[:,n] (in this
# 2D case)
def increment(x): return x+1
amap(histo, increment, flat_ind, histo)
print histo

Actually, I just realised that calling it something related to 'map' is
rather misleading, because the builtin Python map generates a list of the
same length as the sequence passed in; this function doesn't: it
'generates' (fills in, in fact) an array of the same size as the target
passed in.  Can't think of a better name, though.

So: is there a way of doing this (implementing amap) efficiently already?
I don't think there is, but I'd be interested to be proved wrong.

I admit I don't like the relatively large number of arguments, but I still
think it is worthwhile.  You could make the *arg_arrays a single argument,
and make target an optional argument, at the end of the arguments:

def amap(function, indices, arg_arrays, target=None):

so that by default, the target array is first created as zeros, then
filled in and returned, or perhaps even better:

def amap(function, arg_arrays, indices=None, target=None):

though the special case where you don't give the indices argument is
already achievable without this function, so it could be argued that
hiding away the indices argument as a mere optional argument obscures the
real point of the function.  The last form also has the (debatable)
benefit of being similar to the built-in Python map:

def map(function, *args):

The example call above then looks like this:

histo = amap(increment, [histo], indices=flat_ind)

Another couple of tweaks might be to allow scalar values to replace the
arg_arrays argument, and to allow normal indices rather than flattened
ones (the only reason I started out with flat indices was that put() did
the same):

def amap(function, args, indices=None, target=None):

So that the example call looks like:

histo = amap(increment, [histo], indices=ind)

or

histo = amap(Numeric.add, [histo, 1], ind)

I like those last two.  Even if there is already an efficient way of doing
this with existing Numeric functions, I think this is a natural way of
thinking about some operations on arrays -- this one, at least -- and
would be useful to have in Numeric.

For real efficiency, I suppose the right thing to do would be to have a
map method on ufuncs, as is the case for reduce:

histo = Numeric.add.map([histo, 1], ind)

but it would be more work to implement than just a standard function, and
less general.

I've made a first attempt at the function, in C (will upload diff against
20.2.0 if anyone so requests).  It only accepts one arg array, probably
doesn't really do exactly the same as the Python function above (for
better or worse), and it's a mess of a function, since I rarely write
anything in C.  It is also incorrect in at least some ways (most
obviously, it only deals with Python floats / C doubles / Numpy
PyArray_DOUBLEs), though there are some tests (which pass).

I'm wasn't sure how fast it would be, since after all it's only using
ordinary Python functions, not ufuncs, but my crude tests appeared to show
an improvement of about a factor of two over an optimised version of the
Python function above.  The slow bit seems to be constructing the argument
list / passing arg list to function, so I removed that to make the
comparison, along with an unecessary attribute access in the loop.
Presumably the general (> 1 arg_array) version would show a bigger
improvement over the corresponding Python version.

By the way, does anyone understand what the reduceat ufunc method is
supposed to do?  Seems to be the Beale cipher of Numeric -- having studied
its output to see if it was the function I was looking for, I can find no
sense in it and kind of suspect that it was constructed only to keep us
puzzling. ;) Is Jim Hugunin not available for comment on the matter, or is
this somebody else's dirty work?

John

amap: new function / ufunc method?

John J. Lee

cavallo＠kip.uni-heidelberg.de

Magnus Lie Hetland

Travis Oliphant

John J. Lee

cavallo＠kip.uni-heidelberg.de

Magnus Lie Hetland

Travis Oliphant

John J. Lee

tags

participants (4)