Here are some actual numbers within the context of operation (nogil removed and def'd for linetracing)

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================                                           
                                             # Bilinear identity to shave off some flops
                                             # inv(V-U) (V+U) = inv(V-U) (V-U+2V) = I + 2 inv(V-U) U
       1         15.0     15.0      0.1      daxpy(&n2, &neg_one, &work2[0], &int1, &work3[0], &int1)
                                         
                                             # Convert array layout for solving AX = B
       1          3.0      3.0      0.0      for i in range(n):
      60        137.0      2.3      0.9          for j in range(n):
    3600       8437.0      2.3     57.7              work4[j*n+i] = work2[i*n+j]
                                         
       1        408.0    408.0      2.8      dgetrf( &n, &n, &work3[0], &n, &ipiv[0], &info )
       1       4122.0   4122.0     28.2      dgetrs(<char*>'T', &n, &n, &work3[0], &n, &ipiv[0], &work4[0], &n, &info )
       1         25.0     25.0      0.2      dscal(&n2, &two, &work4[0], &int1)
                                             # Add identity matrix
       1          4.0      4.0      0.0      for i in range(n):
      60        146.0      2.4      1.0          work4[i*(n+1)] += 1.
       1         16.0     16.0      0.1      dcopy(&n2, &work4[0], &int1, &Am[0, 0, 0], &int1)

On Thu, Nov 11, 2021 at 1:04 AM Ilhan Polat <ilhanpolat@gmail.com> wrote:

Hmm not sure I understand the question but this is what I mean by naive looping, suppose I allocate a scratch register work3, then

for i in range(n): for j in range(n): work3[j*n+i] = work2[i*n+j]

This basically doing the row to column based indexing and obviously we create a lot of cache misses since work3 entries are accessed in the shuffled fashion. The idea of all this Cython attempt is to avoid such access hence if the original some_C_layout_func takes 10 units of time, 6 of it is spent on this loop when the data doesn't fit the cache. When I discard the correctness of the function and comment out this loop and then remeasure the original func spends roughly 3 units of time. However take any random array in C order in NumPy using regular Python and use np.asfortranarray() it spends roughly about 0.1 units of time. So apparently it is possible to do this somehow at the low level in a performant way. That's what I would like to understand or clear out my misunderstanding.

On Thu, Nov 11, 2021 at 12:56 AM Andras Deak <deak.andris@gmail.com> wrote:
On Thursday, November 11, 2021, Ilhan Polat <ilhanpolat@gmail.com> wrote:
I've asked this in Cython mailing list but probably I should also get some feedback here too.

I have the following function defined in Cython and using flat memory pointers to hold n by n array data.

cdef some_C_layout_func(double[:, :, ::1] Am) nogil: # ... cdef double *work1 = <double*>malloc(n*n*sizeof(double)) cdef double *work2 = <double*>malloc(n*n*sizeof(double)) # ... # Lots of C-layout operations here # ... dgetrs(<char*>'T', &n, &n, &work1[0], &n, &ipiv[0], &work2[0], &n, &info ) dcopy(&n2, &work2[0], &int1, &Am[0, 0, 0], &int1) free(...)

Here, I have done everything in C layout with work1 and work2 but I have to convert work2 into Fortran layout to be able to solve AX = B. A can be transposed in Lapack internally via the flag 'T' so the only obstacle I have now is to shuffle work2 which holds B transpose in the eyes of Fortran since it is still in C layout.

If I go naively and make loops to get one layout to the other that actually spoils all the speed benefits from this Cythonization due to cache misses. In fact 60% of the time is spent in that naive loop across the whole function.

Sorry if this is a dumb question, but is this true whether or not you loop over contiguous blocks of the input vs the output array? Or is the faster of the two options still slower than the linsolve?

András

Same goes for the copy_fortran() of memoryviews.

I have measured the regular NumPy np.asfortranarray() and the performance is quite good enough compared to the actual linear solve. Hence whatever it is doing underneath I would like to reach out and do the same possibly via the C-API. But my C knowledge basically failed me around this line https://github.com/numpy/numpy/blob/8dbd507fb6c854b362c26a0dd056cd04c9c10f25/numpy/core/src/multiarray/multiarraymodule.c#L1817

I have found the SO post from https://stackoverflow.com/questions/45143381/making-a-memoryview-c-contiguous-fortran-contiguous but I am not sure if that is the canonical way to do it in newer Python versions.

Can anyone show me how to go about it without interacting with Python objects?

Best,
ilhan

_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: ilhanpolat@gmail.com