[Numpy-discussion] RFC: Detecting array changes (NumPy 2.0?)
Dag Sverre Seljebotn
d.s.seljebotn at astro.uio.no
Fri Mar 11 13:41:39 EST 2011
There's a few libraries out there that needs to know whether or not an
array changed since the last time it was used: joblib and pymc comes to
mind. I believe joblib computes a SHA1 or md5 hash of array contents,
while pymc simply assume you never change an array and uses the id().
The pymc approach is fragile, while in my case the joblib approach is
too expensive since I'll call the function again many times in a row
with the same large array (yes, I can code around it, but the code gets
So, would it be possible to very quickly detect whether a NumPy array is
guaranteed to not have changed? Here's a revision counter approach:
1) Introduce a new 64-bit int field "modification_count" in the array
2) modification_count is incremented any time it is possible that an
array changes. In particular, PyArray_DATA would increment the counter.
3) A new PyArray_READONLYDATA is introduced that does not increment
the counter, which can be used in strategic spots. However, the point is
simply to rule out *most* sources of having to recompute a checksum for
the array -- a non-matching modification_count is not a guarantee the
array has changed, but an unmatched modification_count is a guarantee of
an unchanged array
4) The counter can be ignored for readonly (base) arrays.
5a) A method is introduced Python-side,
arr.checksum(algorithm="md5"|"sha1"), that uses this machinery to cache
checksum computation and that can be plugged into joblib.
5b) Alternatively, the modification count is exposed directly to
Python-side, and it is up to users to store the modification count (e.g.
in a WeakKeyDictionary indexed by the array's base array).
Another solution to the problem would be to allow registering event
handlers. Main reason I'm not proposing that is because I don't want to
spend the time to implement it (sounds a lot more difficult), it appears
to be considerably less backwards-compatible, and so on.
Why not a simple dirty flag? Because you'd need one for every possible
application of this (e.g, md5 and sha1 would need seperate dirty flags,
and other uses than hashing would need yet more flags, and so on).
More information about the NumPy-Discussion