[Numpy-discussion] Proxy array class and units

Wed Feb 13 21:12:58 EST 2008

Hi all,

I'm looking for some advice on how to implement a 'unit checking proxy array'. 

Background: I'm writing a package to run simulations which make extensive use of
linear algebra, for which I'm using numpy. However - it is important to my
package that quantities can have dimesions, so I've written a class Quantity
subclassed from float which includes the physical dimensions and raises
exceptions etc. if you try to add volts to kilograms or whatever. This seems to
work fine, but I also wanted arrays containing physical dimensions. I wrote a
class qarray subclassed from numpy.ndarray which mostly just copies the
functionality of an array with dtype=object. Here is the issue. The code for my
package internally doesn't use these qarray objects, it uses numpy arrays
because it would obviously be terribly slow to be repeatedly checking if the
dimensions were consistent all the time. However, I want to present a (somewhat)
consistent strict unit-checking front to the user of this package. As one part
of this, I would like a class that appears to act exactly like a qarray (which
itself is supposed to act very much like a numpy array) but maintains a
connection with its numpy array counterpart in my package.

Suppose I have an array X in my package, then I want an object Y which the user
is free to manipulate like an array, but if they do something like Y[0]=2*volt
then the object Y does (1) check that Y[0] has volt units, (2) if so, update
Y[0] and also X[0]. My question is about what is the best way to implement
something like this.

So far, what I've done is written a proxy array class (call it pqarray say)
derived from qarray (which is pretty much just array with dtype=object but not
exactly as it does some other things too). When my package creates a pqarray for
the user, it does the following: take the underlying numpy array X; create Y a
qarray based on X with the correct units (this necessarily copies the data in X
because the dtype is different); create Z from X, Y, a pqarray that knows that Y
is based on X. The class pqarray simply overrides the __setitem__ method to
update itself and X if the units are correct (or raise an exception). Now, this
does actually work, but it's not terribly elegant and there is a problem.

The standard way that this proxy array will be accessed is through another
object, let's just call it P. P has a property val (it has to be a property in
general, because the internal numpy array could change at any moment). If you
write P.val it returns a pqarray corresponding to the internal representation of
the data. Let's suppose P.val has length 1000. Suppose the user writes:

for i in range(1000):
  P.val[i] = i*volt

Now what happens is that this creates a new pqarray 1000 times, and this
involves copying 1000 elements each time for something of the order of 1M
operations total when it should really only be of the order of 1000 operations
total. Now, obviously given the way it's written, the user 'ought' to write:

val = P.val
for i in range(1000):
  val[i] = i*volt

or even better (and what you would do in practice):

P.val = qarray(range(1000))*volt

and then it will be O(1000) operations, but (a) the user shouldn't need to know
this, and (b) my pqarray class seems horrifically ugly and I want something
better.

It seems to me that my options are:

(1) Use my current solution, and instruct the user about the potential problem
(which probably won't come up too often anyway).

(2) Write pqarray to be a subclass of numpy.ndarray so that no data is copied in
creating the pqarray, and give pqarray __getitem__ and __setitem__ methods that
return a Quantity with the right units and do unit checking, respectively. I
have actually implemented this, and it sort of works but I have some worries
about it. For example, if Y is a pqarray of the numpy array X then it seems
print Y will give the same result as print X (i.e. it forgets the units). I
could presumably look this up and override this behaviour but that's a game
which has no end (or at least, a very distant end that might end up with me
having reimplemented the whole of numpy.ndarray).

(3) Have some sort of caching mechanism, so that P.val returns the cached value
unless the underlying numpy array is changed. The problem with this is that you
have to keep track of whether or not the underlying numpy array has changed
which could be fiddly and rather unsafe.

(4) Give up hoping that P.val can be a subclass of an array, make it a new class
that keeps a copy of the necessary units and the underlying array, and can do
some basic array operations. Since it's not an array type object itself, the
user can be told not to expect it to behave like one. This would be quite a sad
solution though, because I want the user to be able to write things like
mean(P.val) and have them just work as they'd expect (which works at the moment).

(5) Something I haven't thought of yet...?

Any advice of any sort would be very much appreciated. I did search for relevant
information but nothing much came up. Plenty about how to do subclassing of
numpy array's, but that's not really my problem here - it's more specific than
that.

Many thanks in advance,

--
Dan Goodman
http://thesamovar.net/contact