<div dir="ltr"><div>Dear all, <br></div><div><br></div><div>I'm writing some helper Cythpm functions for scipy.linalg which is kinda performant and usable. And there is still quite some wiggle room for more. <br></div><div><br></div><div>In many linalg routines there is a lot of performance benefit if the structure can be discovered in a cheap and reliable way at the outset. For example if symmetric then eig can delegate to eigh or if triangular then triangular solvers can be used in linalg.solve and lstsq so forth</div><div><br></div><div>Here is the Cythonized version for Jupyter notebook to paste to discover the lower/upper bandwidth of square array A that competes well with A != 0 just to use some low level function (note the latter returns an array hence more cost is involved) There is a higher level supervisor function that checks C-contiguousness otherwise specializes to different versions of it <br></div><div><br></div><div>Initial cell <br></div><div><br></div><div><span style="font-family:monospace">%load_ext Cython<br>%load_ext line_profiler<br>import cython<br>import line_profiler</span></div><div><br></div><div>Then another cell</div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace">%%cython <br># cython: language_level=3<br># cython: linetrace=True<br># cython: binding = True<br># distutils: define_macros=CYTHON_TRACE=1<br># distutils: define_macros=CYTHON_TRACE_NOGIL=1<br><br>cimport cython<br>cimport numpy as cnp<br>import numpy as np<br>import line_profiler<br>ctypedef fused np_numeric_t:<br>    cnp.int8_t<br>    cnp.int16_t<br>    cnp.int32_t<br>    cnp.int64_t<br>    cnp.uint8_t<br>    cnp.uint16_t<br>    cnp.uint32_t<br>    cnp.uint64_t<br>    cnp.float32_t<br>    cnp.float64_t<br>    cnp.complex64_t<br>    cnp.complex128_t<br>    cnp.int_t<br>    cnp.long_t<br>    cnp.longlong_t<br>    cnp.uint_t<br>    cnp.ulong_t<br>    cnp.ulonglong_t<br>    cnp.intp_t<br>    cnp.uintp_t<br>    cnp.float_t<br>    cnp.double_t<br>    cnp.longdouble_t<br><br><br>@cython.linetrace(True)<br>@cython.initializedcheck(False)<br>@cython.boundscheck(False)<br>@cython.wraparound(False)<br>cpdef inline (int, int) band_check_internal(np_numeric_t[:, ::1]A): <br>    cdef Py_ssize_t n = A.shape[0], lower_band = 0, upper_band = 0, r, c<br>    cdef np_numeric_t zero = 0<br><br>    for r in xrange(n):<br>        # Only bother if outside the existing band:<br>        for c in xrange(r-lower_band):<br>            if A[r, c] != zero:<br>                lower_band = r - c<br>                break<br><br>        for c in xrange(n - 1, r + upper_band, -1):<br>            if A[r, c] != zero:<br>                upper_band = c - r<br>                break<br><br>    return lower_band, upper_band</span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace"><font face="arial,sans-serif">Final cell for use-case ---------------</font><br></span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace"># Make arbitrary lower-banded array<br>n = 50 # array size <br>k = 3 # k'th subdiagonal<br>R = np.zeros([n, n], dtype=np.float32)<br>R[[x for x in range(n)], [x for x in range(n)]] = 1<br>R[[x for x in range(n-1)], [x for x in range(1,n)]] = 1<br>R[[x for x in range(1,n)], [x for x in range(n-1)]] = 1<br>R[[x for x in range(k,n)], [x for x in range(n-k)]] = 2</span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace"><span style="font-family:arial,sans-serif">Some very haphazardly put together metrics </span><br></span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace">%timeit band_check_internal(R)<br>2.59 µs ± 84.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)<br><br>%timeit np.linalg.solve(R, zzz)<br>824 µs ± 6.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)</span></div><div><span style="font-family:monospace"><br></span></div><div><span style="font-family:monospace">%timeit R != 0.<br>1.65 µs ± 43.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)<font face="arial,sans-serif"><br></font></span></div><div><span style="font-family:monospace"><font face="arial,sans-serif"><br></font></span></div><div><span style="font-family:monospace"><font face="arial,sans-serif">So the worst case cost is negligible in general (note that the given code is slower as it uses the fused type however if I go with tempita standalone version is faster)<br></font></span></div><div><span style="font-family:monospace"><font face="arial,sans-serif"><br></font></span></div><div><span style="font-family:monospace"><font face="arial,sans-serif">Two questions: <br></font></span></div><div><span style="font-family:monospace"><font face="arial,sans-serif"><br></font></span></div><div><span style="font-family:monospace"><font face="arial,sans-serif">1) This is missing np.half/float16 functionality since any arithmetic with float16 is might not be reliable including nonzero check. IS it safe to view it as np.uint16 and use that specialization? I'm not sure about the sign bit hence the question. I can leave this out since almost all linalg suite rejects this datatype due to well-known lack of supprt. <br></font></span></div><div><span style="font-family:monospace"><font face="arial,sans-serif"><br></font></span></div><div><span style="font-family:monospace"><font face="arial,sans-serif">2) Should this be in NumPy or SciPy linalg? It is quite relevant to be on SciPy but then again this stuff is purely about array structures. But if the opinion is for NumPy then I would need a volunteer because NumPy codebase flies way above my head. <br></font></span></div><div><span style="font-family:monospace"><font face="arial,sans-serif"><br></font></span></div><div><br><span style="font-family:monospace"></span></div><div>All feedback welcome</div><div><br></div><div>Best</div><div>ilhan<br></div><div><br></div></div>