Dear all,
I'm writing some helper Cythpm functions for scipy.linalg which is kinda performant and usable. And there is still quite some wiggle room for more.
In many linalg routines there is a lot of performance benefit if the structure can be discovered in a cheap and reliable way at the outset. For example if symmetric then eig can delegate to eigh or if triangular then triangular solvers can be used in linalg.solve and lstsq so forth
Here is the Cythonized version for Jupyter notebook to paste to discover the lower/upper bandwidth of square array A that competes well with A != 0 just to use some low level function (note the latter returns an array hence more cost is involved) There is a higher level supervisor function that checks C-contiguousness otherwise specializes to different versions of it
Initial cell
%load_ext Cython %load_ext line_profiler import cython import line_profiler
Then another cell
%%cython # cython: language_level=3 # cython: linetrace=True # cython: binding = True # distutils: define_macros=CYTHON_TRACE=1 # distutils: define_macros=CYTHON_TRACE_NOGIL=1
cimport cython cimport numpy as cnp import numpy as np import line_profiler ctypedef fused np_numeric_t: cnp.int8_t cnp.int16_t cnp.int32_t cnp.int64_t cnp.uint8_t cnp.uint16_t cnp.uint32_t cnp.uint64_t cnp.float32_t cnp.float64_t cnp.complex64_t cnp.complex128_t cnp.int_t cnp.long_t cnp.longlong_t cnp.uint_t cnp.ulong_t cnp.ulonglong_t cnp.intp_t cnp.uintp_t cnp.float_t cnp.double_t cnp.longdouble_t
@cython.linetrace(True) @cython.initializedcheck(False) @cython.boundscheck(False) @cython.wraparound(False) cpdef inline (int, int) band_check_internal(np_numeric_t[:, ::1]A): cdef Py_ssize_t n = A.shape[0], lower_band = 0, upper_band = 0, r, c cdef np_numeric_t zero = 0
for r in xrange(n): # Only bother if outside the existing band: for c in xrange(r-lower_band): if A[r, c] != zero: lower_band = r - c break
for c in xrange(n - 1, r + upper_band, -1): if A[r, c] != zero: upper_band = c - r break
return lower_band, upper_band
Final cell for use-case ---------------
# Make arbitrary lower-banded array n = 50 # array size k = 3 # k'th subdiagonal R = np.zeros([n, n], dtype=np.float32) R[[x for x in range(n)], [x for x in range(n)]] = 1 R[[x for x in range(n-1)], [x for x in range(1,n)]] = 1 R[[x for x in range(1,n)], [x for x in range(n-1)]] = 1 R[[x for x in range(k,n)], [x for x in range(n-k)]] = 2
Some very haphazardly put together metrics
%timeit band_check_internal(R) 2.59 µs ± 84.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.linalg.solve(R, zzz) 824 µs ± 6.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit R != 0. 1.65 µs ± 43.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So the worst case cost is negligible in general (note that the given code is slower as it uses the fused type however if I go with tempita standalone version is faster)
Two questions:
1) This is missing np.half/float16 functionality since any arithmetic with float16 is might not be reliable including nonzero check. IS it safe to view it as np.uint16 and use that specialization? I'm not sure about the sign bit hence the question. I can leave this out since almost all linalg suite rejects this datatype due to well-known lack of supprt.
2) Should this be in NumPy or SciPy linalg? It is quite relevant to be on SciPy but then again this stuff is purely about array structures. But if the opinion is for NumPy then I would need a volunteer because NumPy codebase flies way above my head.
All feedback welcome
Best ilhan
Hi Ilhan,
Overall I think something like this would be great. However, I wonder if you considered having a specialized container with a structure tag instead of trying to discover the structure. If it's a container, it can neatly wrap various lapack storage schemes and dispatch to an appropriate lapack functionality. Possibly even sparse storage schemes. And it seems a bit more robust than trying to discover the structure (e.g. what about off-band elements of \sim 1e-16 etc).
The next question is of course if this should live in scipy/numpy .linalg or as a separate repo, at least for some time (maybe in the scipy organization?). So that it can iterate faster, among other things. (I'd be interested in contributing FWIW)
Cheers,
Evgeni
On Wed, Jun 30, 2021 at 1:22 AM Ilhan Polat ilhanpolat@gmail.com wrote:
Dear all,
I'm writing some helper Cythpm functions for scipy.linalg which is kinda performant and usable. And there is still quite some wiggle room for more.
In many linalg routines there is a lot of performance benefit if the structure can be discovered in a cheap and reliable way at the outset. For example if symmetric then eig can delegate to eigh or if triangular then triangular solvers can be used in linalg.solve and lstsq so forth
Here is the Cythonized version for Jupyter notebook to paste to discover the lower/upper bandwidth of square array A that competes well with A != 0 just to use some low level function (note the latter returns an array hence more cost is involved) There is a higher level supervisor function that checks C-contiguousness otherwise specializes to different versions of it
Initial cell
%load_ext Cython %load_ext line_profiler import cython import line_profiler
Then another cell
%%cython # cython: language_level=3 # cython: linetrace=True # cython: binding = True # distutils: define_macros=CYTHON_TRACE=1 # distutils: define_macros=CYTHON_TRACE_NOGIL=1
cimport cython cimport numpy as cnp import numpy as np import line_profiler ctypedef fused np_numeric_t: cnp.int8_t cnp.int16_t cnp.int32_t cnp.int64_t cnp.uint8_t cnp.uint16_t cnp.uint32_t cnp.uint64_t cnp.float32_t cnp.float64_t cnp.complex64_t cnp.complex128_t cnp.int_t cnp.long_t cnp.longlong_t cnp.uint_t cnp.ulong_t cnp.ulonglong_t cnp.intp_t cnp.uintp_t cnp.float_t cnp.double_t cnp.longdouble_t
@cython.linetrace(True) @cython.initializedcheck(False) @cython.boundscheck(False) @cython.wraparound(False) cpdef inline (int, int) band_check_internal(np_numeric_t[:, ::1]A): cdef Py_ssize_t n = A.shape[0], lower_band = 0, upper_band = 0, r, c cdef np_numeric_t zero = 0
for r in xrange(n): # Only bother if outside the existing band: for c in xrange(r-lower_band): if A[r, c] != zero: lower_band = r - c break for c in xrange(n - 1, r + upper_band, -1): if A[r, c] != zero: upper_band = c - r break return lower_band, upper_band
Final cell for use-case ---------------
# Make arbitrary lower-banded array n = 50 # array size k = 3 # k'th subdiagonal R = np.zeros([n, n], dtype=np.float32) R[[x for x in range(n)], [x for x in range(n)]] = 1 R[[x for x in range(n-1)], [x for x in range(1,n)]] = 1 R[[x for x in range(1,n)], [x for x in range(n-1)]] = 1 R[[x for x in range(k,n)], [x for x in range(n-k)]] = 2
Some very haphazardly put together metrics
%timeit band_check_internal(R) 2.59 µs ± 84.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.linalg.solve(R, zzz) 824 µs ± 6.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit R != 0. 1.65 µs ± 43.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So the worst case cost is negligible in general (note that the given code is slower as it uses the fused type however if I go with tempita standalone version is faster)
Two questions:
This is missing np.half/float16 functionality since any arithmetic with float16 is might not be reliable including nonzero check. IS it safe to view it as np.uint16 and use that specialization? I'm not sure about the sign bit hence the question. I can leave this out since almost all linalg suite rejects this datatype due to well-known lack of supprt.
Should this be in NumPy or SciPy linalg? It is quite relevant to be on SciPy but then again this stuff is purely about array structures. But if the opinion is for NumPy then I would need a volunteer because NumPy codebase flies way above my head.
All feedback welcome
Best ilhan
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Ah right. So two things, the original reason f9r this question is because I can't decide in https://github.com/scipy/scipy/pull/12824 whether others would also benefit from quick structure determination.
I can keep it private function or we can put them some misc or lib folder so all can use. Say there is a special method for triangular matrices but you can't guarantee the structure so you can quickly check for it. At worst O(n**2) complexity for diagonal arrays and almost O(2n) for full arrays makes it quite appealing.
But then again maybe NumPy is a better place since probably it will be faster to have this in pure C with the right headers and without the extra Cython overhead.
Funny you mention the container idea. This is precisely what I'm doing in PR mentioned above (I'll push when I'm done). I stole the idea from Tim Davis himself in a Julia discussion for keeping the factorization as an attribute to be used later if need be. So yes it makes a lot of sense Sparse or not.
On Wed, 30 Jun 2021, 19:14 Evgeni Burovski, evgeny.burovskiy@gmail.com wrote:
Hi Ilhan,
Overall I think something like this would be great. However, I wonder if you considered having a specialized container with a structure tag instead of trying to discover the structure. If it's a container, it can neatly wrap various lapack storage schemes and dispatch to an appropriate lapack functionality. Possibly even sparse storage schemes. And it seems a bit more robust than trying to discover the structure (e.g. what about off-band elements of \sim 1e-16 etc).
The next question is of course if this should live in scipy/numpy .linalg or as a separate repo, at least for some time (maybe in the scipy organization?). So that it can iterate faster, among other things. (I'd be interested in contributing FWIW)
Cheers,
Evgeni
On Wed, Jun 30, 2021 at 1:22 AM Ilhan Polat ilhanpolat@gmail.com wrote:
Dear all,
I'm writing some helper Cythpm functions for scipy.linalg which is kinda
performant and usable. And there is still quite some wiggle room for more.
In many linalg routines there is a lot of performance benefit if the
structure can be discovered in a cheap and reliable way at the outset. For example if symmetric then eig can delegate to eigh or if triangular then triangular solvers can be used in linalg.solve and lstsq so forth
Here is the Cythonized version for Jupyter notebook to paste to discover
the lower/upper bandwidth of square array A that competes well with A != 0 just to use some low level function (note the latter returns an array hence more cost is involved) There is a higher level supervisor function that checks C-contiguousness otherwise specializes to different versions of it
Initial cell
%load_ext Cython %load_ext line_profiler import cython import line_profiler
Then another cell
%%cython # cython: language_level=3 # cython: linetrace=True # cython: binding = True # distutils: define_macros=CYTHON_TRACE=1 # distutils: define_macros=CYTHON_TRACE_NOGIL=1
cimport cython cimport numpy as cnp import numpy as np import line_profiler ctypedef fused np_numeric_t: cnp.int8_t cnp.int16_t cnp.int32_t cnp.int64_t cnp.uint8_t cnp.uint16_t cnp.uint32_t cnp.uint64_t cnp.float32_t cnp.float64_t cnp.complex64_t cnp.complex128_t cnp.int_t cnp.long_t cnp.longlong_t cnp.uint_t cnp.ulong_t cnp.ulonglong_t cnp.intp_t cnp.uintp_t cnp.float_t cnp.double_t cnp.longdouble_t
@cython.linetrace(True) @cython.initializedcheck(False) @cython.boundscheck(False) @cython.wraparound(False) cpdef inline (int, int) band_check_internal(np_numeric_t[:, ::1]A): cdef Py_ssize_t n = A.shape[0], lower_band = 0, upper_band = 0, r, c cdef np_numeric_t zero = 0
for r in xrange(n): # Only bother if outside the existing band: for c in xrange(r-lower_band): if A[r, c] != zero: lower_band = r - c break for c in xrange(n - 1, r + upper_band, -1): if A[r, c] != zero: upper_band = c - r break return lower_band, upper_band
Final cell for use-case ---------------
# Make arbitrary lower-banded array n = 50 # array size k = 3 # k'th subdiagonal R = np.zeros([n, n], dtype=np.float32) R[[x for x in range(n)], [x for x in range(n)]] = 1 R[[x for x in range(n-1)], [x for x in range(1,n)]] = 1 R[[x for x in range(1,n)], [x for x in range(n-1)]] = 1 R[[x for x in range(k,n)], [x for x in range(n-k)]] = 2
Some very haphazardly put together metrics
%timeit band_check_internal(R) 2.59 µs ± 84.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops
each)
%timeit np.linalg.solve(R, zzz) 824 µs ± 6.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit R != 0. 1.65 µs ± 43.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops
each)
So the worst case cost is negligible in general (note that the given
code is slower as it uses the fused type however if I go with tempita standalone version is faster)
Two questions:
- This is missing np.half/float16 functionality since any arithmetic
with float16 is might not be reliable including nonzero check. IS it safe to view it as np.uint16 and use that specialization? I'm not sure about the sign bit hence the question. I can leave this out since almost all linalg suite rejects this datatype due to well-known lack of supprt.
- Should this be in NumPy or SciPy linalg? It is quite relevant to be
on SciPy but then again this stuff is purely about array structures. But if the opinion is for NumPy then I would need a volunteer because NumPy codebase flies way above my head.
All feedback welcome
Best ilhan
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
If you're going to provide routines for structure determination it might be worth looking at algorithms that can identify more general or less obvious structure as well. SymPy's matrices module needs a lot of work and is improving a lot which will become noticeable over the next few releases but one of the important optimisations being used is Tarjan's algorithm for finding the strongly connected components of a graph. This is a generalisation of checking for triangular or diagonal matrices. With this approach you can identify any permutation of the rows and columns of a square matrix that can bring it into block triangular or block diagonal form which can reduce many O(n**3) algorithms substantially. The big-O for Tarjan's algorithm itself is basically the same as checking whether a matrix is triangular/diagonal.
For example the matrix determinant is invariant under permutations of the rows and columns. If you can permute a matrix into block triangular form then the determinant is just the product of the determinants of the diagonal blocks. If the base case algorithm has n**3 operations then reducing it to two operations of size n/2 is a speed up of ~4x. In the extreme this discovers that a matrix is triangular and reduces the whole operation to O(n) (plus the cost of Tarjan's algorithm). However the graph-based approach also benefits wider classes e.g. you get almost all the same benefit for a matrix that is almost diagonal but has a few off-diagonal elements.
Using sympy master branch (as .strongly_connected_components() is not released yet):
In [19]: from sympy import Matrix
In [20]: M = Matrix([[1, 0, 2, 0], [9, 3, 1, 2], [3, 0, 4, 0], [5, 8, 6, 7]])
In [21]: M Out[21]: ⎡1 0 2 0⎤ ⎢9 3 1 2⎥ ⎢3 0 4 0⎥ ⎣5 8 6 7⎦
In [22]: M.strongly_connected_components() # Tarjan's algorithm Out[22]: [[0, 2], [1, 3]]
In [23]: M[[0, 2, 1, 3], [0, 2, 1, 3]] # outer indexing for permutation Out[23]: ⎡1 2 0 0⎤ ⎢3 4 0 0⎥ ⎢9 1 3 2⎥ ⎣5 6 8 7⎦
In [24]: M.det() Out[24]: -10
In [25]: M[[0,2],[0,2]].det() * M[[1, 3], [1, 3]].det() Out[25]: -10
-- Oscar
On Fri, 2 Jul 2021 at 12:20, Ilhan Polat ilhanpolat@gmail.com wrote:
Ah right. So two things, the original reason f9r this question is because I can't decide in https://github.com/scipy/scipy/pull/12824 whether others would also benefit from quick structure determination.
I can keep it private function or we can put them some misc or lib folder so all can use. Say there is a special method for triangular matrices but you can't guarantee the structure so you can quickly check for it. At worst O(n**2) complexity for diagonal arrays and almost O(2n) for full arrays makes it quite appealing.
But then again maybe NumPy is a better place since probably it will be faster to have this in pure C with the right headers and without the extra Cython overhead.
Funny you mention the container idea. This is precisely what I'm doing in PR mentioned above (I'll push when I'm done). I stole the idea from Tim Davis himself in a Julia discussion for keeping the factorization as an attribute to be used later if need be. So yes it makes a lot of sense Sparse or not.
On Wed, 30 Jun 2021, 19:14 Evgeni Burovski, evgeny.burovskiy@gmail.com wrote:
Hi Ilhan,
Overall I think something like this would be great. However, I wonder if you considered having a specialized container with a structure tag instead of trying to discover the structure. If it's a container, it can neatly wrap various lapack storage schemes and dispatch to an appropriate lapack functionality. Possibly even sparse storage schemes. And it seems a bit more robust than trying to discover the structure (e.g. what about off-band elements of \sim 1e-16 etc).
The next question is of course if this should live in scipy/numpy .linalg or as a separate repo, at least for some time (maybe in the scipy organization?). So that it can iterate faster, among other things. (I'd be interested in contributing FWIW)
Cheers,
Evgeni
On Wed, Jun 30, 2021 at 1:22 AM Ilhan Polat ilhanpolat@gmail.com wrote:
Dear all,
I'm writing some helper Cythpm functions for scipy.linalg which is kinda performant and usable. And there is still quite some wiggle room for more.
In many linalg routines there is a lot of performance benefit if the structure can be discovered in a cheap and reliable way at the outset. For example if symmetric then eig can delegate to eigh or if triangular then triangular solvers can be used in linalg.solve and lstsq so forth
Here is the Cythonized version for Jupyter notebook to paste to discover the lower/upper bandwidth of square array A that competes well with A != 0 just to use some low level function (note the latter returns an array hence more cost is involved) There is a higher level supervisor function that checks C-contiguousness otherwise specializes to different versions of it
Initial cell
%load_ext Cython %load_ext line_profiler import cython import line_profiler
Then another cell
%%cython # cython: language_level=3 # cython: linetrace=True # cython: binding = True # distutils: define_macros=CYTHON_TRACE=1 # distutils: define_macros=CYTHON_TRACE_NOGIL=1
cimport cython cimport numpy as cnp import numpy as np import line_profiler ctypedef fused np_numeric_t: cnp.int8_t cnp.int16_t cnp.int32_t cnp.int64_t cnp.uint8_t cnp.uint16_t cnp.uint32_t cnp.uint64_t cnp.float32_t cnp.float64_t cnp.complex64_t cnp.complex128_t cnp.int_t cnp.long_t cnp.longlong_t cnp.uint_t cnp.ulong_t cnp.ulonglong_t cnp.intp_t cnp.uintp_t cnp.float_t cnp.double_t cnp.longdouble_t
@cython.linetrace(True) @cython.initializedcheck(False) @cython.boundscheck(False) @cython.wraparound(False) cpdef inline (int, int) band_check_internal(np_numeric_t[:, ::1]A): cdef Py_ssize_t n = A.shape[0], lower_band = 0, upper_band = 0, r, c cdef np_numeric_t zero = 0
for r in xrange(n): # Only bother if outside the existing band: for c in xrange(r-lower_band): if A[r, c] != zero: lower_band = r - c break for c in xrange(n - 1, r + upper_band, -1): if A[r, c] != zero: upper_band = c - r break return lower_band, upper_band
Final cell for use-case ---------------
# Make arbitrary lower-banded array n = 50 # array size k = 3 # k'th subdiagonal R = np.zeros([n, n], dtype=np.float32) R[[x for x in range(n)], [x for x in range(n)]] = 1 R[[x for x in range(n-1)], [x for x in range(1,n)]] = 1 R[[x for x in range(1,n)], [x for x in range(n-1)]] = 1 R[[x for x in range(k,n)], [x for x in range(n-k)]] = 2
Some very haphazardly put together metrics
%timeit band_check_internal(R) 2.59 µs ± 84.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.linalg.solve(R, zzz) 824 µs ± 6.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit R != 0. 1.65 µs ± 43.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So the worst case cost is negligible in general (note that the given code is slower as it uses the fused type however if I go with tempita standalone version is faster)
Two questions:
This is missing np.half/float16 functionality since any arithmetic with float16 is might not be reliable including nonzero check. IS it safe to view it as np.uint16 and use that specialization? I'm not sure about the sign bit hence the question. I can leave this out since almost all linalg suite rejects this datatype due to well-known lack of supprt.
Should this be in NumPy or SciPy linalg? It is quite relevant to be on SciPy but then again this stuff is purely about array structures. But if the opinion is for NumPy then I would need a volunteer because NumPy codebase flies way above my head.
All feedback welcome
Best ilhan
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
Yes they go by the name of morally triangular matrices (quite a stupid name but in their defense I think it was an insider joke) this is also given in Tim Davis' book as an exercise via linked lists. The issue is that LAPACK doesn't support these permuted matrices. Hence we are left with two options
Either copy paste row/columns so that the array stays contiguous or permute a copy of the array. Both can be significant cost while trying to shave off solving time.
But you are right this can be present even though solvers and eig routines won't use it. I'll put my Cython code back in.
On Fri, 2 Jul 2021, 20:05 Oscar Benjamin, oscar.j.benjamin@gmail.com wrote:
If you're going to provide routines for structure determination it might be worth looking at algorithms that can identify more general or less obvious structure as well. SymPy's matrices module needs a lot of work and is improving a lot which will become noticeable over the next few releases but one of the important optimisations being used is Tarjan's algorithm for finding the strongly connected components of a graph. This is a generalisation of checking for triangular or diagonal matrices. With this approach you can identify any permutation of the rows and columns of a square matrix that can bring it into block triangular or block diagonal form which can reduce many O(n**3) algorithms substantially. The big-O for Tarjan's algorithm itself is basically the same as checking whether a matrix is triangular/diagonal.
For example the matrix determinant is invariant under permutations of the rows and columns. If you can permute a matrix into block triangular form then the determinant is just the product of the determinants of the diagonal blocks. If the base case algorithm has n**3 operations then reducing it to two operations of size n/2 is a speed up of ~4x. In the extreme this discovers that a matrix is triangular and reduces the whole operation to O(n) (plus the cost of Tarjan's algorithm). However the graph-based approach also benefits wider classes e.g. you get almost all the same benefit for a matrix that is almost diagonal but has a few off-diagonal elements.
Using sympy master branch (as .strongly_connected_components() is not released yet):
In [19]: from sympy import Matrix
In [20]: M = Matrix([[1, 0, 2, 0], [9, 3, 1, 2], [3, 0, 4, 0], [5, 8, 6, 7]])
In [21]: M Out[21]: ⎡1 0 2 0⎤ ⎢9 3 1 2⎥ ⎢3 0 4 0⎥ ⎣5 8 6 7⎦
In [22]: M.strongly_connected_components() # Tarjan's algorithm Out[22]: [[0, 2], [1, 3]]
In [23]: M[[0, 2, 1, 3], [0, 2, 1, 3]] # outer indexing for permutation Out[23]: ⎡1 2 0 0⎤ ⎢3 4 0 0⎥ ⎢9 1 3 2⎥ ⎣5 6 8 7⎦
In [24]: M.det() Out[24]: -10
In [25]: M[[0,2],[0,2]].det() * M[[1, 3], [1, 3]].det() Out[25]: -10
-- Oscar
On Fri, 2 Jul 2021 at 12:20, Ilhan Polat ilhanpolat@gmail.com wrote:
Ah right. So two things, the original reason f9r this question is
because I can't decide in https://github.com/scipy/scipy/pull/12824 whether others would also benefit from quick structure determination.
I can keep it private function or we can put them some misc or lib
folder so all can use. Say there is a special method for triangular matrices but you can't guarantee the structure so you can quickly check for it. At worst O(n**2) complexity for diagonal arrays and almost O(2n) for full arrays makes it quite appealing.
But then again maybe NumPy is a better place since probably it will be
faster to have this in pure C with the right headers and without the extra Cython overhead.
Funny you mention the container idea. This is precisely what I'm doing
in PR mentioned above (I'll push when I'm done). I stole the idea from Tim Davis himself in a Julia discussion for keeping the factorization as an attribute to be used later if need be. So yes it makes a lot of sense Sparse or not.
On Wed, 30 Jun 2021, 19:14 Evgeni Burovski, evgeny.burovskiy@gmail.com
wrote:
Hi Ilhan,
Overall I think something like this would be great. However, I wonder if you considered having a specialized container with a structure tag instead of trying to discover the structure. If it's a container, it can neatly wrap various lapack storage schemes and dispatch to an appropriate lapack functionality. Possibly even sparse storage schemes. And it seems a bit more robust than trying to discover the structure (e.g. what about off-band elements of \sim 1e-16 etc).
The next question is of course if this should live in scipy/numpy .linalg or as a separate repo, at least for some time (maybe in the scipy organization?). So that it can iterate faster, among other things. (I'd be interested in contributing FWIW)
Cheers,
Evgeni
On Wed, Jun 30, 2021 at 1:22 AM Ilhan Polat ilhanpolat@gmail.com
wrote:
Dear all,
I'm writing some helper Cythpm functions for scipy.linalg which is
kinda performant and usable. And there is still quite some wiggle room for more.
In many linalg routines there is a lot of performance benefit if the
structure can be discovered in a cheap and reliable way at the outset. For example if symmetric then eig can delegate to eigh or if triangular then triangular solvers can be used in linalg.solve and lstsq so forth
Here is the Cythonized version for Jupyter notebook to paste to
discover the lower/upper bandwidth of square array A that competes well with A != 0 just to use some low level function (note the latter returns an array hence more cost is involved) There is a higher level supervisor function that checks C-contiguousness otherwise specializes to different versions of it
Initial cell
%load_ext Cython %load_ext line_profiler import cython import line_profiler
Then another cell
%%cython # cython: language_level=3 # cython: linetrace=True # cython: binding = True # distutils: define_macros=CYTHON_TRACE=1 # distutils: define_macros=CYTHON_TRACE_NOGIL=1
cimport cython cimport numpy as cnp import numpy as np import line_profiler ctypedef fused np_numeric_t: cnp.int8_t cnp.int16_t cnp.int32_t cnp.int64_t cnp.uint8_t cnp.uint16_t cnp.uint32_t cnp.uint64_t cnp.float32_t cnp.float64_t cnp.complex64_t cnp.complex128_t cnp.int_t cnp.long_t cnp.longlong_t cnp.uint_t cnp.ulong_t cnp.ulonglong_t cnp.intp_t cnp.uintp_t cnp.float_t cnp.double_t cnp.longdouble_t
@cython.linetrace(True) @cython.initializedcheck(False) @cython.boundscheck(False) @cython.wraparound(False) cpdef inline (int, int) band_check_internal(np_numeric_t[:, ::1]A): cdef Py_ssize_t n = A.shape[0], lower_band = 0, upper_band = 0,
r, c
cdef np_numeric_t zero = 0 for r in xrange(n): # Only bother if outside the existing band: for c in xrange(r-lower_band): if A[r, c] != zero: lower_band = r - c break for c in xrange(n - 1, r + upper_band, -1): if A[r, c] != zero: upper_band = c - r break return lower_band, upper_band
Final cell for use-case ---------------
# Make arbitrary lower-banded array n = 50 # array size k = 3 # k'th subdiagonal R = np.zeros([n, n], dtype=np.float32) R[[x for x in range(n)], [x for x in range(n)]] = 1 R[[x for x in range(n-1)], [x for x in range(1,n)]] = 1 R[[x for x in range(1,n)], [x for x in range(n-1)]] = 1 R[[x for x in range(k,n)], [x for x in range(n-k)]] = 2
Some very haphazardly put together metrics
%timeit band_check_internal(R) 2.59 µs ± 84.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops
each)
%timeit np.linalg.solve(R, zzz) 824 µs ± 6.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
each)
%timeit R != 0. 1.65 µs ± 43.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops
each)
So the worst case cost is negligible in general (note that the given
code is slower as it uses the fused type however if I go with tempita standalone version is faster)
Two questions:
- This is missing np.half/float16 functionality since any arithmetic
with float16 is might not be reliable including nonzero check. IS it safe to view it as np.uint16 and use that specialization? I'm not sure about the sign bit hence the question. I can leave this out since almost all linalg suite rejects this datatype due to well-known lack of supprt.
- Should this be in NumPy or SciPy linalg? It is quite relevant to
be on SciPy but then again this stuff is purely about array structures. But if the opinion is for NumPy then I would need a volunteer because NumPy codebase flies way above my head.
All feedback welcome
Best ilhan
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion
NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion