[Cython] Be more forgiving about memoryview strides

Fri Mar 1 22:14:53 CET 2013

On Fri, 2013-03-01 at 12:17 -0800, Robert Bradshaw wrote:
> On Fri, Mar 1, 2013 at 7:56 AM, Sebastian Berg
> <sebastian at sipsolutions.net> wrote:
> > On Thu, 2013-02-28 at 23:25 -0800, Robert Bradshaw wrote:
> >> On Thu, Feb 28, 2013 at 11:12 AM, Nathaniel Smith <njs at pobox.com> wrote:
> >> > On Thu, Feb 28, 2013 at 5:50 PM, Robert Bradshaw <robertwb at gmail.com> wrote:
> >> >> On Thu, Feb 28, 2013 at 7:13 AM, Sebastian Berg
> >> >> <sebastian at sipsolutions.net> wrote:
> >> >>> Hey,
> >> >>>
> >> >>> Maybe someone here already saw it (I don't have a track account, or I
> >> >>> would just create a ticket), but it would be nice if Cython was more
> >> >>> forgiving about contiguous requirements on strides. In the future this
> >> >>> would make it easier for numpy to go forward with changing the
> >> >>> contiguous flags to be more reasonable for its purpose, and second also
> >> >>> to allow old (and maybe for the moment remaining) corner cases in numpy
> >> >>> to slip past (as well as possibly the same for other programs...). An
> >> >>> example is (see also https://github.com/numpy/numpy/issues/2956 and the
> >> >>> PR linked there for more details):
> >> >>>
> >> >>> def add_one(array):
> >> >>>     cdef double[::1] a = array
> >> >>>     a[0] += 1.
> >> >>>     return array
> >> >>>
> >> >>> giving:
> >> >>>
> >> >>>>>> add_one(np.ascontiguousarray(np.arange(10.)[::100]))
> >> >>> ValueError: Buffer and memoryview are not contiguous in the same
> >> >>> dimension.
> >> >>>
> >> >>> This could easily be changed if MemoryViews check the strides as "can be
> >> >>> interpreted as contiguous". That means that if shape[i] == 1, then
> >> >>> strides[i] are arbitrary (you can just change them if you like). This is
> >> >>> also the case for 0-sized arrays, which are arguably always contiguous,
> >> >>> no matter their strides are!
> >> >>
> >> >> I was under the impression that the primary value for contiguous is
> >> >> that it a foo[::1] can be interpreted as a foo*. Letting strides be
> >> >> arbitrary completely breaks this, right?
> >> >
> >> > Nope. The natural definition of "C contiguous" is "the array entries
> >> > are arranged in memory in the same way they would be if they were a
> >> > multidimensional C array" (i.e., what you said.) But it turns out that
> >> > this is *not* the definition that numpy and cython use!
> >> >
> >> > The issue is that the above definition is a constraint on the actual
> >> > locations of items in memory, i.e., given a shape, it tells you that
> >> > for every index,
> >> >  (a)  sum(index * strides) == sum(index * cumprod(shape[::-1])[::-1] * itemsize)
> >> > Obviously this equality holds if
> >> >  (b)  strides == cumprod(shape[::-1])[::-1] * itemsize
> >> > (Or for F-contiguity, we have
> >> >  (b')  strides == cumprod(shape) * itemsize
> >> > )
> >> >
> >> > (a) is the natural definition of "C contiguous". (b) is the definition
> >> > of "C contiguous" used by numpy and cython. (b) implies (a). But (a)
> >> > does not imply (b), i.e., there are arrays that are C-contiguous which
> >> > numpy and cython think are discontiguous. (Also in numpy there are
> >> > some weird cases where numpy accidentally uses the correct definition,
> >> > I think, which is the point of Sebastian's example.)
> >> >
> >> > In particular, if shape[i] == 1, then the value of stride[i] really
> >> > should be irrelevant to judging contiguity, because the only thing you
> >> > can do with strides[i] is multiply it by index[i], and if shape[i] ==
> >> > 1 then index[i] is always 0. So an array of int8's with shape = (10,
> >> > 1), strides = (1, 73) is contiguous according to (a), but not
> >> > according to (b). Also if shape[i] is 0 for any i, then the entire
> >> > contents of the strides array becomes irrelevant to judging
> >> > contiguity; all zero-sized arrays are contiguous according to (a), but
> >> > not (b).
> >>
> >> Thanks for clarifying.
> >>
> >> Yes, I think it makes a lot of sense to loosen our definition for
> >> Cython. Internally, I think the only way we use this assumption is in
> >> not requiring that the first/final index be multiplied by the stride,
> >> which should be totally fine. But this merits closer inspection as
> >> there may be something else.
> >
> > The only problem I saw was code that used strides[-1] instead of the
> > itemsize (e.g. using strides[i]/strides[-1] to then index the typed
> > buffer instead of using strides[i]/itemsize). But that should be easy to
> > check, numpy had two or so cases of that itself...
> 
> I'd be surprised if we do that, but the only way to be sure would be
> to look at the code.
> 
> >> > (This is really annoying for numpy because given, say, a column vector
> >> > with shape (n, 1), it is impossible to be both C- and F-contiguous
> >> > according to the (b)-style definition. But people expect expect
> >> > various operations to preserve C versus F contiguity, so there are
> >> > heuristics in numpy that try to guess whether various result arrays
> >> > should pretend to be C- or F-contiguous, and we don't even have a
> >> > consistent idea of what it would mean for this code to be working
> >> > correctly, never mind test it and keep it working. OTOH if we just fix
> >> > numpy to use the (a) definition, then it turns out a bunch of
> >> > third-party code breaks, like, for example, cython.)
> >>
> >> Can you give some examples?
> >>
> >
> > Not sure for what :).
> 
> I meant examples of possible breakage.
> 
> > Maybe this is an example:
> >
> > In [1]: a = np.asmatrix(np.arange(9).reshape(3,3).T)
> >
> > In [2]: a.flags.f_contiguous
> > Out[2]: True
> >
> > In [3]: a[:,0].flags
> > Out[3]:
> >   C_CONTIGUOUS : True
> >   F_CONTIGUOUS : False
> >   ...
> >
> > Where that view could just as well be F-contiguous, and the fact that
> > numpy, when in doubt, prefers C-contiguous might be surprising. And
> > since it would be less strict to begin with, numpy may safe a copy here
> > or there (without adding weird stride fixing code).
> >
> > Examples for code breakage would be this check as well as scikit-learn
> > and scipy in 3 or 4 cases making the assumption above of itemsize ==
> > strides[-1] for c-contiguous arrays.
> 
> Ah.
> 
> So, assuming Cython itself isn't making such assumptions, what support
> do you want from Cython? I can see (1) accepting as c/f contiguous
> arrays that meet this looser definition and (2) setting these flags in
> memoryviews we produce under this looser definition. Is there anything
> else?
> 

Just accepting it would be cool. I am not aware that (2) would matter
for numpy, so just do whatever you feel best. I doubt numpy will change
them in a release version any time soon, but will be nice to know that
it can without breaking cython based code!

I am wondering if there is a way to work around/warn users doing this
(this is what sk-learn had):

    cdef np.ndarray[ndim=2, mode='c'] a = array
    step = a.strides[0]/a.strides[1]
    # Then using a.data[step]

but I am not sure. I first thought that if it is easy, you could point
a.strides to the buffers strides, allowing numpy to fix those. But just
realized that it would be weird since ndarray.strides is an attribute
that can be set.
And since as I understand this is discouraged already, it is probably
not worth it to think about it much.

- Sebastian

> - Robert
> _______________________________________________
> cython-devel mailing list
> cython-devel at python.org
> http://mail.python.org/mailman/listinfo/cython-devel
>