[Numpy-discussion] Re: Conversion from C-layout to Fortran-layout in Cython

Nov. 11, 2021


      Ah I see. Thank you Sebastian, I was hoping to avoid all that blocking
(since HW dependency leaves some performance at many tables) or recursive
zooming stuff with some off-the-shelf tool but apparently I'm walking in
the dusty corners again collecting spider webs :) As you said, there are
quite a lot of low hanging fruits we might collect regarding such data
manipulations which will boost basically everything since these ops are
ubiquitous.

In case any one is wondering the context; this is for the scipy.linalg.expm
overhaul mainly kept updated at https://github.com/scipy/scipy/issues/12838


On Thu, Nov 11, 2021 at 2:40 AM Sebastian Berg <sebastian@sipsolutions.net>
wrote:
...
On Thu, 2021-11-11 at 01:04 +0100, Ilhan Polat wrote:
...
Hmm not sure I understand the question but this is what I mean by naive
looping, suppose I allocate a scratch register work3, then
for i in range(n): for j in range(n): work3[j*n+i] = work2[i*n+j]
NumPy does not end up doing anything special.  Special would be to use
a blocked iteration and NumPy doesn't have it unfortunately.
The only thing it does is use pointers to cut some overheads, something
(very rough) like:
ptr1 = arr1.data
    ptr2_col = arr2.data
strides2_col = arr.strides[0]
    strides2_row = arr2.strides[1]
for i in range(n):
        ptr2 = ptr2_col
        for j in range(n):
             *ptr2 = *ptr1
             ptr1++
             ptr2 += strides2_row
ptr2_col += strides2_col
And if you write that in cython, you are likely faster since you can
cut quite a few corners (all is aligned, contiguous, etc.).
(with potentially, loop unrolling/compiler optimization fluctuations,
numpy probably tells GCC to unroll and optimize the innermost loop
there)
I would not be surprised if you can find a lightweight fast copy-
transpose out there, or if some tools like MKL/Cuda just include it. It
is too bad NumPy is missing it.
Cheers,
Sebastian
...
This basically doing the row to column based indexing and obviously we
create a lot of cache misses since work3 entries are accessed in the
shuffled fashion. The idea of all this Cython attempt is to avoid such
access hence if the original some_C_layout_func takes 10 units of time,
6
of it is spent on this loop when the data doesn't fit the cache. When I
discard the correctness of the function and comment out this loop and
then
remeasure the original func spends roughly 3 units of time. However
take
any random array in C order in NumPy using regular Python and use
np.asfortranarray() it spends roughly about 0.1 units of time. So
apparently it is possible to do this somehow at the low level in a
performant way. That's what I would like to understand or clear out my
misunderstanding.
On Thu, Nov 11, 2021 at 12:56 AM Andras Deak <deak.andris@gmail.com>
wrote:
...
On Thursday, November 11, 2021, Ilhan Polat <ilhanpolat@gmail.com>
wrote:
...
I've asked this in Cython mailing list but probably I should also
get
some feedback here too.
I have the following function defined in Cython and using flat
memory
pointers to hold n by n array data.
cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef
double *
work1 = <double*>malloc(n*n*sizeof(double)) cdef double *work2 =
<double
*>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations
here #
... dgetrs(<char*>'T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0],
&n, &
info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)
Here, I have done everything in C layout with work1 and work2 but I
have
to convert work2 into Fortran layout to be able to solve AX = B. A
can be
transposed in Lapack internally via the flag 'T' so the only
obstacle I
have now is to shuffle work2 which holds B transpose in the eyes of
Fortran
since it is still in C layout.
If I go naively and make loops to get one layout to the other that
actually spoils all the speed benefits from this Cythonization due
to cache
misses. In fact 60% of the time is spent in that naive loop across
the
whole function.
Sorry if this is a dumb question, but is this true whether or not you
loop
over contiguous blocks of the input vs the output array? Or is the
faster
of the two options still slower than the linsolve?
András
...
Same goes for the copy_fortran() of memoryviews.
I have measured the regular NumPy np.asfortranarray()  and the
performance is quite good enough compared to the actual linear
solve. Hence
whatever it is doing underneath I would like to reach out and do
the same
possibly via the C-API. But my C knowledge basically failed me
around this
line
https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25...
...
...
...
I have found the SO post from
https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguou...
...
but I am not sure if that is the canonical way to do it in newer
Python
versions.
Can anyone show me how to go about it without interacting with
Python
objects?
Best,
ilhan
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: ilhanpolat@gmail.com
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: sebastian@sipsolutions.net
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: ilhanpolat@gmail.com