[Numpy-discussion] Behavior of .base

Mon Oct 1 08:20:32 EDT 2012

On Sun, Sep 30, 2012 at 8:59 PM, Travis Oliphant <travis at continuum.io> wrote:
> Hey all,
>
> In a github-discussion with Gael and Nathaniel, we came up with a proposal for .base that we should put before this list.    Traditionally, .base has always pointed to None for arrays that owned their own memory and to the "most immediate" array object parent for arrays that did not own their own memory.   There was a long-standing issue related to running out of stack space that this behavior created.

To be *completely* accurate, I'd say that they've always pointed to
some object that owned the underlying memory. Usually that's an
ndarray, but sometimes that's a thing exposing the buffer interface,
sometimes it's a thing exposing __array_interface__, sometimes it's a
mmap object, sometimes it's some random ad hoc C-level wrapper
object[1], etc.

[1] e.g. https://github.com/njsmith/scikits-sparse/blob/master/scikits/sparse/cholmod.pyx#L225

> Recently this behavior was altered so that .base always points to "the original" object holding the memory (something exposing the buffer interface).   This created some problems for users who relied on the fact that most of the time .base pointed to an instance of an array object.
>
> The proposal here is to change the behavior of .base for arrays that don't own their own memory so that the .base attribute of an array points to "the most original object" that is still an instance of the type of the array.      This would go into the 1.7.0 release so as to correct the issues reported.
>
> What are reactions to this proposal?

As a band-aid to avoid breaking some code in 1.7, it seems reasonable
to me. I was actually considering proposing basically the same idea.
But it's only a band-aid; the larger problem is that we don't *know*
what semantics people are relying on for "base" (and probably aren't
implementing the ones people think we are, either before or after this
change).

As an example of how messy this is: do you know whether Gael's code
will still work, after we make this fix, if someone uses as_strided()
on a (view of a) memmap array?

Answer: as_strided() creates an ndarray view on an ad-hoc object with
__array_interface__ attribute, and this dummy object ends up as the
returned ndarray's .base. According to the proposed rule, the .base
chain collapsing will stop at this point. So it isn't true that an
array that is ultimately backed by mmap will have a .memmap() array as
its .base. However, if you read stride_tricks.py, it turns out the
dummy object as_strided makes does happen to use the name ".base" for
its attribute holding the original array, so Gael's code will work
correctly in this case iff he keeps the .base walking code in place
(which would otherwise serve no purpose after Travis' change).

Anyway, my point is: If we have to carefully analyze interactions
between code in numpy.lib.stride_tricks, numpy.core.memmap, and a
third-party library, just to figure out which sorts of
reference-counting changes are correct in the core ndarray object,
then we have a problem. This is horrible cross-coupling, the sort of
thing that, if allowed to proliferate, makes it impossible to ever
know whether code is correct or not.

So even if we put in a band-aid for 1.7, we really don't want to be
guaranteeing this kind of stuff forever, and should aggressively
encourage people to stop using .base in these ways. The mmap thing
should really switch to something more reliable and less tightly
coupled to the rest of the code all over numpy, like I described here:
  http://mail.scipy.org/pipermail/numpy-discussion/2012-September/064003.html

How can we discourage people from doing this in the future? Can we
make .base write-only from the Python level (with suitable deprecation
period)? Rename it to ._base (likewise) so that it's still possible to
peek under the covers but we remind people that it's really an
implementation detail with poorly defined semantics that might change?

-n