[Numpy-discussion] RFC: Detecting array changes (NumPy 2.0?)

Fri Mar 11 13:47:58 EST 2011

On Fri, Mar 11, 2011 at 11:41 AM, Dag Sverre Seljebotn <
d.s.seljebotn at astro.uio.no> wrote:

> There's a few libraries out there that needs to know whether or not an
> array changed since the last time it was used: joblib and pymc comes to
> mind. I believe joblib computes a SHA1 or md5 hash of array contents,
> while pymc simply assume you never change an array and uses the id().
>
> The pymc approach is fragile, while in my case the joblib approach is
> too expensive since I'll call the function again many times in a row
> with the same large array (yes, I can code around it, but the code gets
> less streamlined).
>
> So, would it be possible to very quickly detect whether a NumPy array is
> guaranteed to not have changed? Here's a revision counter approach:
>
>  1) Introduce a new 64-bit int field "modification_count" in the array
> object struct.
>
>  2) modification_count is incremented any time it is possible that an
> array changes. In particular, PyArray_DATA would increment the counter.
>
>  3) A new PyArray_READONLYDATA is introduced that does not increment
> the counter, which can be used in strategic spots. However, the point is
> simply to rule out *most* sources of having to recompute a checksum for
> the array -- a non-matching modification_count is not a guarantee the
> array has changed, but an unmatched modification_count is a guarantee of
> an unchanged array
>
>  4) The counter can be ignored for readonly (base) arrays.
>
>  5a) A method is introduced Python-side,
> arr.checksum(algorithm="md5"|"sha1"), that uses this machinery to cache
> checksum computation and that can be plugged into joblib.
>
>  5b) Alternatively, the modification count is exposed directly to
> Python-side, and it is up to users to store the modification count (e.g.
> in a WeakKeyDictionary indexed by the array's base array).
>
> Another solution to the problem would be to allow registering event
> handlers. Main reason I'm not proposing that is because I don't want to
> spend the time to implement it (sounds a lot more difficult), it appears
> to be considerably less backwards-compatible, and so on.
>
> Why not a simple dirty flag? Because you'd need one for every possible
> application of this (e.g, md5 and sha1 would need seperate dirty flags,
> and other uses than hashing would need yet more flags, and so on).
>
>
What about views? Wouldn't it be easier to write another object wrapping an
ndarray?

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110311/710f5be2/attachment.html>