[Cython] Be more forgiving about memoryview strides

Thu Feb 28 20:12:09 CET 2013

On Thu, Feb 28, 2013 at 5:50 PM, Robert Bradshaw <robertwb at gmail.com> wrote:
> On Thu, Feb 28, 2013 at 7:13 AM, Sebastian Berg
> <sebastian at sipsolutions.net> wrote:
>> Hey,
>>
>> Maybe someone here already saw it (I don't have a track account, or I
>> would just create a ticket), but it would be nice if Cython was more
>> forgiving about contiguous requirements on strides. In the future this
>> would make it easier for numpy to go forward with changing the
>> contiguous flags to be more reasonable for its purpose, and second also
>> to allow old (and maybe for the moment remaining) corner cases in numpy
>> to slip past (as well as possibly the same for other programs...). An
>> example is (see also https://github.com/numpy/numpy/issues/2956 and the
>> PR linked there for more details):
>>
>> def add_one(array):
>>     cdef double[::1] a = array
>>     a[0] += 1.
>>     return array
>>
>> giving:
>>
>>>>> add_one(np.ascontiguousarray(np.arange(10.)[::100]))
>> ValueError: Buffer and memoryview are not contiguous in the same
>> dimension.
>>
>> This could easily be changed if MemoryViews check the strides as "can be
>> interpreted as contiguous". That means that if shape[i] == 1, then
>> strides[i] are arbitrary (you can just change them if you like). This is
>> also the case for 0-sized arrays, which are arguably always contiguous,
>> no matter their strides are!
>
> I was under the impression that the primary value for contiguous is
> that it a foo[::1] can be interpreted as a foo*. Letting strides be
> arbitrary completely breaks this, right?

Nope. The natural definition of "C contiguous" is "the array entries
are arranged in memory in the same way they would be if they were a
multidimensional C array" (i.e., what you said.) But it turns out that
this is *not* the definition that numpy and cython use!

The issue is that the above definition is a constraint on the actual
locations of items in memory, i.e., given a shape, it tells you that
for every index,
 (a)  sum(index * strides) == sum(index * cumprod(shape[::-1])[::-1] * itemsize)
Obviously this equality holds if
 (b)  strides == cumprod(shape[::-1])[::-1] * itemsize
(Or for F-contiguity, we have
 (b')  strides == cumprod(shape) * itemsize
)

(a) is the natural definition of "C contiguous". (b) is the definition
of "C contiguous" used by numpy and cython. (b) implies (a). But (a)
does not imply (b), i.e., there are arrays that are C-contiguous which
numpy and cython think are discontiguous. (Also in numpy there are
some weird cases where numpy accidentally uses the correct definition,
I think, which is the point of Sebastian's example.)

In particular, if shape[i] == 1, then the value of stride[i] really
should be irrelevant to judging contiguity, because the only thing you
can do with strides[i] is multiply it by index[i], and if shape[i] ==
1 then index[i] is always 0. So an array of int8's with shape = (10,
1), strides = (1, 73) is contiguous according to (a), but not
according to (b). Also if shape[i] is 0 for any i, then the entire
contents of the strides array becomes irrelevant to judging
contiguity; all zero-sized arrays are contiguous according to (a), but
not (b).

(This is really annoying for numpy because given, say, a column vector
with shape (n, 1), it is impossible to be both C- and F-contiguous
according to the (b)-style definition. But people expect expect
various operations to preserve C versus F contiguity, so there are
heuristics in numpy that try to guess whether various result arrays
should pretend to be C- or F-contiguous, and we don't even have a
consistent idea of what it would mean for this code to be working
correctly, never mind test it and keep it working. OTOH if we just fix
numpy to use the (a) definition, then it turns out a bunch of
third-party code breaks, like, for example, cython.)

-n