should a benchmark be meaningful and assert accuracy?
https://github.com/scipy/scipy/issues/18568 https://github.com/scipy/scipy/pull/18954 Update/rewrite sparse_linalg_lobpcg.py to make it more reasonable raise a discussion if a benchmark needs to make a meaningful comparison and assert accuracy or just barely run the code for timing even if the code produces garbage and the comparison baseline is irrelevant. In my view, a benchmark is a tool for a user to choose a better performing solver. If a benchmark is meaningless computing garbage because the solver is called in a wrong way or the comparison baseline is wrong, it fails this goal - the issue with the present benchmark. It is also a tool for a developer to validate that changes in a code do not degrade performance. For iterative solvers, performance is determined by a total execution time, which is the time per iteration times the number of iterations needed to get the requested accuracy. The present benchmark does not check the accuracy, so fails this goal too, i.e., is quite useless and misleading. The property that for a known problem a code of an iterative solver gives a guaranteed accuracy in a known fixed number of iterations is desired but not given and must be checked and rechecked with every chance of the code.
Hi Andrew, Thanks for bringing this up. On Wed, Jul 26, 2023 at 8:04 PM Knyazev, Andrew <Andrew.Knyazev@ucdenver.edu> wrote:
https://github.com/scipy/scipy/issues/18568
https://github.com/scipy/scipy/pull/18954
Update/rewrite sparse_linalg_lobpcg.py to make it more reasonable
raise a discussion if a benchmark needs to make a meaningful comparison and assert accuracy or just barely run the code for timing even if the code produces garbage and the comparison baseline is irrelevant.
The answer is somewhere in between these two extremes. A benchmark should be *meaningful*, at least in the sense that whatever code snippet and input parameters are being benchmarked should be reflective of what is relevant for real-world code. Example: if for some linalg API the key input shapes are tall and square matrices, benchmark those and not wide matrices. It can/should be done with generated random data (as long as the structure is somewhat realistic of course, if you need block structure you can't just call `np.random.rand`), as it's typically not worth using data files with real data (images or whatever). That said, the code being benchmarked should be small and correct. That correctness should be checked once, and if there's doubt whether the accuracy of something like an iterative solver is stable, a test case with the same calls should be added to the test suite. It's not desirable for the benchmarks to start doubling as pytest-like test cases.
In my view, a benchmark is a tool for a user to choose a better performing solver. If a benchmark is meaningless computing garbage because the solver is called in a wrong way or the comparison baseline is wrong, it fails this goal - the issue with the present benchmark.
Indeed. I hope what I wrote above clarifies that.
It is also a tool for a developer to validate that changes in a code do not degrade performance.
Indeed, that is why we have `bench --compare`, and we can run it across commits or releases to automatically identify regressions in runtime for specific benchmarks.
For iterative solvers, performance is determined by a total execution time, which is the time per iteration times the number of iterations needed to get the requested accuracy. The present benchmark does not check the accuracy, so fails this goal too, i.e., is quite useless and misleading. The property that for a known problem a code of an iterative solver gives a guaranteed accuracy in a known fixed number of iterations is desired but not given and must be checked and rechecked with every chance of the code.
Measuring total runtime is for the benchmark suite, while measured whether a given absolute/relative tolerance level is achieved or a solution is obtained at all in <=N iterations is for the test suite. We have quite a few test cases like that, you can just assert there with something like `solution.n_iter <= 25`. The test suite is run way more often and on a much wider range of platforms, so it is a much better place for that type of check. Cheers, Ralf
Ralf Gommers wrote:
Hi Andrew, Thanks for bringing this up. On Wed, Jul 26, 2023 at 8:04 PM Knyazev, Andrew Andrew.Knyazev@ucdenver.edu wrote:
For iterative solvers, performance is determined by a total execution time, which is the time per iteration times the number of iterations needed to get the requested accuracy. The present benchmark does not check the accuracy, so fails this goal too, i.e., is quite useless and misleading. The property that for a known problem a code of an iterative solver gives a guaranteed accuracy in a known fixed number of iterations is desired but not given and must be checked and rechecked with every chance of the code. Measuring total runtime is for the benchmark suite, while measured whether a given absolute/relative tolerance level is achieved or a solution is obtained at all in <=N iterations is for the test suite. We have quite a few test cases like that, you can just assert there with something like `solution.n_iter <= 25`. The test suite is run way more often and on a much wider range of platforms, so it is a much better place for that type of check. Cheers, Ralf
Ralf, thanks for your quick response! I agree with you in principle, of course. However, the specific benchmark I am editing https://github.com/scipy/scipy/pull/18954/ (search ASV for "sakurai" and "mikota") has its peculiarities that make the standard approach impractical: - One of the solvers under the benchmark comparison, eigsh, has an internal check for convergence and fails if no convergence. The other, lobpcg, only gives a warning if the accuracy is not reached but is designed to never give an error. The third solver is dense so supposedly just never fails. An external assert is needed to make these solvers comparable in a benchmark. - Some of the benchmark tests take long time to run because the matrix size needs to be large enough to make the benchmark meaningful. These runs are simply too long for the test suite, so their accuracy cannot be tested there. - The actual benchmarks in this PR are currently tuned such that there are no actual failures with the current versions of the solvers benchmarked. Practically speaking, we have a rather philosophical discussion like "to be or not to be" on whether a single line assert that currently has no effect can be present in this benchmark in this specific PR or should not be allowed as a matter of principle.
On Thu, Aug 24, 2023 at 4:58 PM Andrew Knyazev <andrew.knyazev@ucdenver.edu> wrote:
Ralf, thanks for your quick response! I agree with you in principle, of course. However, the specific benchmark I am editing https://github.com/scipy/scipy/pull/18954/ (search ASV for "sakurai" and "mikota") has its peculiarities that make the standard approach impractical: - One of the solvers under the benchmark comparison, eigsh, has an internal check for convergence and fails if no convergence. The other, lobpcg, only gives a warning if the accuracy is not reached but is designed to never give an error. The third solver is dense so supposedly just never fails. An external assert is needed to make these solvers comparable in a benchmark.
An assert will never *make* the solvers comparable. Correctly configuring each of the solvers to do the equivalent task makes them comparable.
- Some of the benchmark tests take long time to run because the matrix size needs to be large enough to make the benchmark meaningful. These runs are simply too long for the test suite, so their accuracy cannot be tested there.
We can mark tests as slow or even extra-slow. If we *need* a slow test to ensure that the solvers are correct, then this is the way to do it. We do not run the benchmarks to find errors in the library. https://numpy.org/devdocs/reference/testing.html#labeling-tests https://docs.scipy.org/doc/scipy/dev/contributor/devpy_test.html#tips - The actual benchmarks in this PR are currently tuned such that there are
no actual failures with the current versions of the solvers benchmarked. Practically speaking, we have a rather philosophical discussion like "to be or not to be" on whether a single line assert that currently has no effect can be present in this benchmark in this specific PR or should not be allowed as a matter of principle.
We need to be clear on what it's for. If, as you've stated, it's to ensure that the *solvers* are correct, then no, it doesn't belong in the benchmark. That's a deficiency in the test suite and appropriate tests need to be filled out there. It's not that we object to the assert in the benchmark per se, it's that we object to not fixing the hole in the test suite that you claim is there. If, on the other hand, the purpose of the assert is to ensure that the *benchmark* remains valid (i.e. that all of the individual settings to each solver are such that they are apples-to-apples), then that's fine, IMO. When the assert in the benchmark fails, where do you think you will make fixes, in `scipy/sparse/` or `benchmarks/`? If the latter, then it's fine. If it's the former, then add it to the test suite. -- Robert Kern
@Robert thanks for your comments! I agree with you as far as I can see. - Yes, the proposed purpose of the assert is to ensure that the *benchmark* remains valid (i.e. that all of the individual settings to each solver are such that they are apples-to-apples). Many current benchmarks involving lobpcg are invalid in this sense, e.g., lobpcg produces garbage without failing. The proposed simple assert serves as a sanity check forcing the contributor to tune the benchmark example and parameters of the solvers under benchmarking. - No, adding an accuracy assert in benchmarks is not a substitution to having a similar (but preferably smaller-scale, i.e., faster) run in unit tests. That is why I place the code generating the examples for benchmarking outside of benchmarking thus also callable from the unit tests (Some of these examples then might deserve having a public API but this is a separate discussion that can be postponed). Currently, there appears to be a disconnect between benchmarking and unit testing at least for eigenvalue solvers.
participants (4)
-
Andrew Knyazev -
Knyazev, Andrew -
Ralf Gommers -
Robert Kern