On Mon, Oct 1, 2012 at 6:20 AM, Nathaniel Smith <njs@pobox.com> wrote:
Hey all,
In a github-discussion with Gael and Nathaniel, we came up with a
On Sun, Sep 30, 2012 at 8:59 PM, Travis Oliphant <travis@continuum.io> wrote: proposal for .base that we should put before this list. Traditionally, .base has always pointed to None for arrays that owned their own memory and to the "most immediate" array object parent for arrays that did not own their own memory. There was a long-standing issue related to running out of stack space that this behavior created.
To be *completely* accurate, I'd say that they've always pointed to some object that owned the underlying memory. Usually that's an ndarray, but sometimes that's a thing exposing the buffer interface, sometimes it's a thing exposing __array_interface__, sometimes it's a mmap object, sometimes it's some random ad hoc C-level wrapper object[1], etc.
[1] e.g. https://github.com/njsmith/scikits-sparse/blob/master/scikits/sparse/cholmod...
Recently this behavior was altered so that .base always points to "the original" object holding the memory (something exposing the buffer interface). This created some problems for users who relied on the fact that most of the time .base pointed to an instance of an array object.
The proposal here is to change the behavior of .base for arrays that don't own their own memory so that the .base attribute of an array points to "the most original object" that is still an instance of the type of the array. This would go into the 1.7.0 release so as to correct the issues reported.
What are reactions to this proposal?
As a band-aid to avoid breaking some code in 1.7, it seems reasonable to me. I was actually considering proposing basically the same idea. But it's only a band-aid; the larger problem is that we don't *know* what semantics people are relying on for "base" (and probably aren't implementing the ones people think we are, either before or after this change).
As an example of how messy this is: do you know whether Gael's code will still work, after we make this fix, if someone uses as_strided() on a (view of a) memmap array?
Answer: as_strided() creates an ndarray view on an ad-hoc object with __array_interface__ attribute, and this dummy object ends up as the returned ndarray's .base. According to the proposed rule, the .base chain collapsing will stop at this point. So it isn't true that an array that is ultimately backed by mmap will have a .memmap() array as its .base. However, if you read stride_tricks.py, it turns out the dummy object as_strided makes does happen to use the name ".base" for its attribute holding the original array, so Gael's code will work correctly in this case iff he keeps the .base walking code in place (which would otherwise serve no purpose after Travis' change).
Anyway, my point is: If we have to carefully analyze interactions between code in numpy.lib.stride_tricks, numpy.core.memmap, and a third-party library, just to figure out which sorts of reference-counting changes are correct in the core ndarray object, then we have a problem. This is horrible cross-coupling, the sort of thing that, if allowed to proliferate, makes it impossible to ever know whether code is correct or not.
So even if we put in a band-aid for 1.7, we really don't want to be guaranteeing this kind of stuff forever, and should aggressively encourage people to stop using .base in these ways. The mmap thing should really switch to something more reliable and less tightly coupled to the rest of the code all over numpy, like I described here:
http://mail.scipy.org/pipermail/numpy-discussion/2012-September/064003.html
How can we discourage people from doing this in the future? Can we make .base write-only from the Python level (with suitable deprecation period)? Rename it to ._base (likewise) so that it's still possible to peek under the covers but we remind people that it's really an implementation detail with poorly defined semantics that might change?
Well said. This reminds me of the fellow who used genetic programming to design an algorithm for a signal processing chip and discovered that the result was making use of some stray capacitance present on the chip. Here users such as Gael are the genetic programmers and .base is the stray capacitance. I tend to the ._base idea, but I think this needs to be addressed in detail. Chuck