Hi Andrew,

Thanks for bringing this up.


On Wed, Jul 26, 2023 at 8:04 PM Knyazev, Andrew <Andrew.Knyazev@ucdenver.edu> wrote:

https://github.com/scipy/scipy/issues/18568

https://github.com/scipy/scipy/pull/18954

Update/rewrite sparse_linalg_lobpcg.py to make it more reasonable

raise a discussion if a benchmark needs to make a meaningful comparison and assert accuracy or just barely run the code for timing even if the code produces garbage and the comparison baseline is irrelevant.


The answer is somewhere in between these two extremes. A benchmark should be *meaningful*, at least in the sense that whatever code snippet and input parameters are being benchmarked should be reflective of what is relevant for real-world code. Example: if for some linalg API the key input shapes are tall and square matrices, benchmark those and not wide matrices. It can/should be done with generated random data (as long as the structure is somewhat realistic of course, if you need block structure you can't just call `np.random.rand`), as it's typically not worth using data files with real data (images or whatever).

That said, the code being benchmarked should be small and correct. That correctness should be checked once, and if there's doubt whether the accuracy of something like an iterative solver is stable, a test case with the same calls should be added to the test suite. It's not desirable for the benchmarks to start doubling as pytest-like test cases.
 

In my view, a benchmark is a tool for a user to choose a better performing solver. If a benchmark is meaningless computing garbage because the solver is called in a wrong way or the comparison baseline is wrong, it fails this goal - the issue with the present benchmark.


Indeed. I hope what I wrote above clarifies that.
 

It is also a tool for a developer to validate that changes in a code do not degrade performance.


Indeed, that is why we have `bench --compare`, and we can run it across commits or releases to automatically identify regressions in runtime for specific benchmarks.
 

For iterative solvers, performance is determined by a total execution time, which is the time per iteration times the number of iterations needed to get the requested accuracy. The present benchmark does not check the accuracy, so fails this goal too, i.e., is quite useless and misleading. The property that for a known problem a code of an iterative solver gives a guaranteed accuracy in a known fixed number of iterations is desired but not given and must be checked and rechecked with every chance of the code.


Measuring total runtime is for the benchmark suite, while measured whether a given absolute/relative tolerance level is achieved or a solution is obtained at all in <=N iterations is for the test suite. We have quite a few test cases like that, you can just assert there with something like `solution.n_iter <= 25`. The test suite is run way more often and on a much wider range of platforms, so it is a much better place for that type of check.

Cheers,
Ralf