the state of scipy unit tests
In the past, I would always run 'nosetests scipy' before committing changes to SVN. Due to the current state of the unit tests, I don't anymore, and I suspect I'm not alone. Here are the main offenders on my system: scipy.stats. I appreciate the fact that rigorous testing on this module takes time, but 4 minutes on a 2.4GHz Core 2 system is unreasonable. IMO 20 seconds is a reasonable upper bound. Essential tests that don't meet this time constraint should be filtered out of the default test suite. scipy.weave Takes 2.5 minutes and litters my screen with a few hundred lines like, /tmp/tmpcRI5WR/sc_3a5c7ad3ac45a98d03cd9168232f7d8f1.cpp:618: warning: deprecated conversion from and dumps a bunch of .cpp and .so files in the current working directory. Some minor offenders: scipy.interpolate Emits several UserWarnings (should be filtered out with warnings.warnfilter). scipy.io Several DeprecationWarnings scipy.lib A dozen lines like "zcopy:n=3" scipy.linalg Outputs ATLAS info and a dozen lines like "zcopy:n=3". I'd like to be able to run the entire battery of tests in about a minute with minimal unnecessary output. -- Nathan Bell wnbell@gmail.com http://graphics.cs.uiuc.edu/~wnbell/
On Sun, Nov 23, 2008 at 6:56 PM, Nathan Bell <wnbell@gmail.com> wrote:
scipy.stats. I appreciate the fact that rigorous testing on this module takes time, but 4 minutes on a 2.4GHz Core 2 system is unreasonable. IMO 20 seconds is a reasonable upper bound. Essential tests that don't meet this time constraint should be filtered out of the default test suite.
I agree. Josef asked whether he should reduce the number of tests run bu default and I (perhaps mistakenly) said that he should focus on fixing broken code and writing new tests before we released the first beta. I was thinking that it would be best to have as many tests run by default for the beta release. It would probably be better to enable all the tests by default for tagged beta and rc releases, but not the development trunk. Ideas? As soon as the beta is released, we should focus on reducing the time required to run the default test suite.
scipy.weave Takes 2.5 minutes and litters my screen with a few hundred lines like, /tmp/tmpcRI5WR/sc_3a5c7ad3ac45a98d03cd9168232f7d8f1.cpp:618: warning: deprecated conversion from and dumps a bunch of .cpp and .so files in the current working directory.
I find that annoying too.
I'd like to be able to run the entire battery of tests in about a minute with minimal unnecessary output.
+1, let's make this a focus for post-beta1 attention. Thanks, -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/
On Sun, Nov 23, 2008 at 9:56 PM, Nathan Bell <wnbell@gmail.com> wrote:
In the past, I would always run 'nosetests scipy' before committing changes to SVN. Due to the current state of the unit tests, I don't anymore, and I suspect I'm not alone.
Here are the main offenders on my system:
scipy.stats. I appreciate the fact that rigorous testing on this module takes time, but 4 minutes on a 2.4GHz Core 2 system is unreasonable. IMO 20 seconds is a reasonable upper bound. Essential tests that don't meet this time constraint should be filtered out of the default test suite.
I agree that it is pretty painful, I usually just run nosetests on the module or package level, e.g. 'nosetests scipy.stats' before commit and specific test files while correcting individual functions. For my distributions tests, I use additional tests that are renamed in svn so that nose doesn't pick them up. Is it possible to use an exclude option with nose that excludes for example all tests in scipy stats or specific test files? My problem, that I raised already once on the mailing list, is that I am testing now essentially all methods of close to 100 distributions, some of which require a lot of numerical integration and optimization. I wrote the tests pretty fast, for bug hunting and to get one thorough round of testing during the next beta release. But for everyday usage they are too much. I haven't done any profiling to see which are the most offending distributions, and since there are so many distributions and all tests are generators, it is difficult to special case individual time consuming methods and distributions. Another problem are tests based on random numbers, if the sample size and power of statistical tests are too small (as was the case in scipy until a few months ago), then it doesn't catch many bugs, if the statistical tests should have some power, then they require larger samples and more calculation. My initial attempts to use decorators were not very successful, since nose doesn't allow to decorate test generators. One option would be to label most of my test functions with slow, but I haven't tried this yet. In the old test system, it was possible to assign levels to the tests. I don't know if or how it is possible to label my tests so that a few basic ones are run on a low level and the other ones only at higher levels. Triaging my tests will be quite a bit of work, but the short term solution is to find a way how to exclude most of them for everyday use but keep them available for beta testing. BTW. Is there a way to profile the tests itself (test yielded by generator not the test function)? Josef
On Sun, Nov 23, 2008 at 7:55 PM, <josef.pktd@gmail.com> wrote:
My initial attempts to use decorators were not very successful, since nose doesn't allow to decorate test generators. One option would be to label most of my test functions with slow, but I haven't tried this yet. In the old test system, it was possible to assign levels to the tests. I don't know if or how it is possible to label my tests so that a few basic ones are run on a low level and the other ones only at higher levels.
Fernando Perez wrote some code to allow you to decorate test generators (he mentioned it in an earlier thread, but I don't think we followed up on it). He also raised the question about this at a Baypiggies meeting and Alex Martelli blogged about his thoughts here: http://aleaxit.blogspot.com/2008/11/python-introspecting-for-generator-vs.ht... I'll talk to Fernando tomorrow and make sure we follow-up on this. -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/
Howdy, On Sun, Nov 23, 2008 at 8:25 PM, Jarrod Millman <millman@berkeley.edu> wrote:
On Sun, Nov 23, 2008 at 7:55 PM, <josef.pktd@gmail.com> wrote:
My initial attempts to use decorators were not very successful, since nose doesn't allow to decorate test generators. One option would be to label most of my test functions with slow, but I haven't tried this yet. In the old test system, it was possible to assign levels to the tests. I don't know if or how it is possible to label my tests so that a few basic ones are run on a low level and the other ones only at higher levels.
Fernando Perez wrote some code to allow you to decorate test generators (he mentioned it in an earlier thread, but I don't think we followed up on it). He also raised the question about this at a Baypiggies meeting and Alex Martelli blogged about his thoughts here: http://aleaxit.blogspot.com/2008/11/python-introspecting-for-generator-vs.ht...
I'll talk to Fernando tomorrow and make sure we follow-up on this.
Sorry, I got busy with other things. Here's the diff for decorators, with an implementation that works with generators and also allows the test condition to be a callable (very useful for conditions that you want to evaluate only at suite run time, not at import time). I hadn't sent it because I wanted to polish it and write some tests for it, but here it is for now. I also included a patch for the verbosity problem: the issue is that we're hardcoding '-s' in the test runner, which suppresses stdout capture. This should instead be an option for the user (like test(capture=False)). That diff just disables -s, so it's not finished, but I don't have time right now to implement the complete solution. At least I hope pointing in the right direction will be useful if someone else can finish. Cheers, f
On Sun, Nov 23, 2008 at 10:31 PM, Fernando Perez <fperez.net@gmail.com> wrote:
Sorry, I got busy with other things. Here's the diff for decorators, with an implementation that works with generators and also allows the test condition to be a callable (very useful for conditions that you want to evaluate only at suite run time, not at import time). I hadn't sent it because I wanted to polish it and write some tests for it, but here it is for now.
Thanks, I created a ticket and attached your patch: http://scipy.org/scipy/numpy/ticket/957
I also included a patch for the verbosity problem: the issue is that we're hardcoding '-s' in the test runner, which suppresses stdout capture. This should instead be an option for the user (like test(capture=False)). That diff just disables -s, so it's not finished, but I don't have time right now to implement the complete solution. At least I hope pointing in the right direction will be useful if someone else can finish.
Currently, when running scipy.test('full') there is a large amount of information printed to the screen. Presumably, this information is being printed out because the test writer is using it for debugging information. Your patch (to remove the '-s' option) will help in this respect, but we will need to do more. Just to state my goal: I would like to change the scipy.test so that it behaves more like numpy.test: In [1]: numpy.test('full') Running unit tests for numpy NumPy version 1.3.0.dev6099 NumPy is installed in /home/jarrod/usr/local/lib64/python2.5/site-packages/numpy Python version 2.5.1 (r251:54863, Jul 10 2008, 17:25:56) [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] nose version 0.10.3 ..............<dots repeat>.........K......<dots repeat>.. ---------------------------------------------------------------------- Ran 1768 tests in 4.235s OK (KNOWNFAIL=1) That is it just prints '.' and letter codes. -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/
On Sun, Nov 23, 2008 at 10:57 PM, Jarrod Millman <millman@berkeley.edu> wrote:
respect, but we will need to do more. Just to state my goal: I would like to change the scipy.test so that it behaves more like numpy.test:
In [1]: numpy.test('full') Running unit tests for numpy NumPy version 1.3.0.dev6099 NumPy is installed in /home/jarrod/usr/local/lib64/python2.5/site-packages/numpy Python version 2.5.1 (r251:54863, Jul 10 2008, 17:25:56) [GCC 4.1.2 20070925 (Red Hat 4.1.2-33)] nose version 0.10.3 ..............<dots repeat>.........K......<dots repeat>.. ---------------------------------------------------------------------- Ran 1768 tests in 4.235s
OK (KNOWNFAIL=1)
That is it just prints '.' and letter codes.
Yup, that would be ideal. Just to note, some of the printouts are coming directly (I suspect) from inside the fortran code or the wrappers, so disabling those may take a bit more work. Cheers, f
On Sun, Nov 23, 2008 at 11:57:22PM -0800, Fernando Perez wrote:
Yup, that would be ideal. Just to note, some of the printouts are coming directly (I suspect) from inside the fortran code or the wrappers, so disabling those may take a bit more work.
Actually, it is not that hard, playing with file descriptors. The implementation in the IPython codebase is in: http://bazaar.launchpad.net/%7Eipython-dev/ipython/trunk/annotate/1149?file_... Gaël
Nathan Bell wrote:
In the past, I would always run 'nosetests scipy' before committing changes to SVN. Due to the current state of the unit tests, I don't anymore, and I suspect I'm not alone.
Here are the main offenders on my system:
scipy.stats. I appreciate the fact that rigorous testing on this module takes time, but 4 minutes on a 2.4GHz Core 2 system is unreasonable. IMO 20 seconds is a reasonable upper bound. Essential tests that don't meet this time constraint should be filtered out of the default test suite.
I don't agree much on that reasoning. Test are useful; the more run by default, the better; tests which are not run by default are nearly useless IMO, since not many people would run tests with options; since there are ways to restrict tests to a meaningful subset (per subpackage), I think this is enough; if some tests can be run faster, then ok, but not if it requires to lose some test coverage. Why does the test time matter so much to you ?
scipy.weave Takes 2.5 minutes and litters my screen with a few hundred lines like, /tmp/tmpcRI5WR/sc_3a5c7ad3ac45a98d03cd9168232f7d8f1.cpp:618: warning: deprecated conversion from and dumps a bunch of .cpp and .so files in the current working directory.
Some minor offenders:
scipy.interpolate Emits several UserWarnings (should be filtered out with warnings.warnfilter).
scipy.io Several DeprecationWarnings
scipy.lib A dozen lines like "zcopy:n=3"
scipy.linalg Outputs ATLAS info and a dozen lines like "zcopy:n=3".
Yes, those should be cleaned (except maybe DeprecationWarnings - if the deprecated functions are the one being tested: I am not sure what we should do in that case). But that's a lot of grunt work ; particularly scipy.lib, which should be removed IMHO (as for now, it is mostly redundant with scipy.linalg, except for some unit tests which would be useful to put into scipy.linalg).
I'd like to be able to run the entire battery of tests in about a minute with minimal unnecessary output.
I hear you, I would like the whole build + test process for scipy to be faster too :) If 4 minutes sounds long, what about build + test on windows, which takes at least 20 minutes (to multiply by three when I build the superpack - and the process can't even be controlled remotely) ! David
On Mon, Nov 24, 2008 at 01:09, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Nathan Bell wrote:
In the past, I would always run 'nosetests scipy' before committing changes to SVN. Due to the current state of the unit tests, I don't anymore, and I suspect I'm not alone.
Here are the main offenders on my system:
scipy.stats. I appreciate the fact that rigorous testing on this module takes time, but 4 minutes on a 2.4GHz Core 2 system is unreasonable. IMO 20 seconds is a reasonable upper bound. Essential tests that don't meet this time constraint should be filtered out of the default test suite.
I don't agree much on that reasoning. Test are useful; the more run by default, the better; tests which are not run by default are nearly useless IMO, since not many people would run tests with options; since there are ways to restrict tests to a meaningful subset (per subpackage), I think this is enough; if some tests can be run faster, then ok, but not if it requires to lose some test coverage.
Why does the test time matter so much to you ?
You want to be able to run the main automated test suite every time before you do a check in, and more frequently while you are working on something, so that you make sure you didn't break things you weren't working on. This is a fairly well-accepted principle of testing. No one is suggesting that tests should be deleted, just that they might be moved (or marked) out of the main test suite. Multiple test suites for different purposes and constraints is far from uncommon. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Mon, Nov 24, 2008 at 4:44 PM, Robert Kern <robert.kern@gmail.com> wrote:
Why does the test time matter so much to you ?
You want to be able to run the main automated test suite every time before you do a check in, and more frequently while you are working on something, so that you make sure you didn't break things you weren't working on. This is a fairly well-accepted principle of testing.
Yes, I understand the different suites thing, but that's not what we are talking about, right ? Weave, for example, does not output files when the default test suite is run.
Multiple test suites for different purposes and constraints is far from uncommon.
Sure, I don't argue against different purposes test suite, but about what goes in the default. David
On Sun, Nov 23, 2008 at 11:59 PM, David Cournapeau <cournape@gmail.com> wrote:
Sure, I don't argue against different purposes test suite, but about what goes in the default.
I would like to see different defaults (one for the development trunk; one for binary alpha, beta, and rc releases; and possibly stable releases). For the development trunk, we need a quick and relatively complete default test suite. This will make it easier for developer's to adopt the habit of running the full test suite before checking in any changes to the trunk. If a developer wants to run a more complete test suite, they should be able to run the full suite whenever they want. For the binary alpha, beta, and rc releases, we want the default test suite to be as complete as possible so that we get better feedback from early adopters without having to get them to run the tests with many options. I was imaging something like when building the binaries, I would need to set a flag which could easily be done in the scripts for building binaries: http://projects.scipy.org/scipy/scipy/browser/trunk/tools/win32/build_script... or could use the release flag, which is already being set: http://projects.scipy.org/scipy/scipy/browser/trunk/scipy/version.py For binaries for stable releases, we should decide whether we want the default test suite to favor speed or completeness. Thoughts? -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/
2008/11/24 Jarrod Millman <millman@berkeley.edu>:
On Sun, Nov 23, 2008 at 11:59 PM, David Cournapeau <cournape@gmail.com> wrote:
Sure, I don't argue against different purposes test suite, but about what goes in the default.
I would like to see different defaults (one for the development trunk; one for binary alpha, beta, and rc releases; and possibly stable releases).
There may be issues with this when people modify some packages deeply, but in a way not caught by the standard test battery. And then, when you go to alpha, you get hundreds of failing tests, and it's so overwhelming that you have to start the test battery from scratch. It could be better to have a scipy buildbot, like for numpy, that runs all the tests, and people before committing just check the most important tests. This way, you don't get hundreds of failing tests once you reactivate them, you can still track where the errors come from, and the test time for a single developer remains small (although, he could only check the result of the tests on the package he's modifying). Matthieu -- Information System Engineer, Ph.D. Website: http://matthieu-brucher.developpez.com/ Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn: http://www.linkedin.com/in/matthieubrucher
On Mon, Nov 24, 2008 at 12:26 AM, Matthieu Brucher <matthieu.brucher@gmail.com> wrote:
There may be issues with this when people modify some packages deeply, but in a way not caught by the standard test battery. And then, when you go to alpha, you get hundreds of failing tests, and it's so overwhelming that you have to start the test battery from scratch.
I don't think that is very likely. All the tests are still there and would be run if you ran scipy.test('full'). I would imagine that many people would run the full test-suite regularly. A developer could run scipy.test() for every change they make (if it only takes a short amount of time) and then run scipy.test('full') just once or twice a day.
It could be better to have a scipy buildbot, like for numpy, that runs all the tests, and people before committing just check the most important tests. This way, you don't get hundreds of failing tests once you reactivate them, you can still track where the errors come from, and the test time for a single developer remains small (although, he could only check the result of the tests on the package he's modifying).
I think this should be done any way; but I don't think it solves the problem for developers who want to quickly run the default test suite regularly. I think that decorating more of the tests as slow would solve this problem. The other problem is that we want binaries of alpha, beta, and rc releases to run the full test suite by default, since, in this instance, time isn't as important but completeness is. This could be solved by, for instance, by changing label to be full by default in nosetester.py for tagged releases. This could be done by running a script that takes care of it or by adding some logic that changes the behavior if a flag (e.g., release in version.py) is True in version.py. -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/
On Mon, Nov 24, 2008 at 2:09 AM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
I don't agree much on that reasoning. Test are useful; the more run by default, the better; tests which are not run by default are nearly useless IMO, since not many people would run tests with options; since there are ways to restrict tests to a meaningful subset (per subpackage), I think this is enough; if some tests can be run faster, then ok, but not if it requires to lose some test coverage.
As a general rule, more tests are better. OTOH, tests that people *choose not to run* are not helpful.
Why does the test time matter so much to you ?
I want to know that my changes to scipy.sparse haven't adversely affected other parts of scipy. To my knowledge, there are only a few such modules (io, maxentropy, spatial, and sparse.linalg), so I could, in principle, test those directly and call it a day. However, it's possible that modules that depend on those modules will expose errors that would be hidden otherwise.
I hear you, I would like the whole build + test process for scipy to be faster too :) If 4 minutes sounds long, what about build + test on windows, which takes at least 20 minutes (to multiply by three when I build the superpack - and the process can't even be controlled remotely) !
Still, I'm not giving up my C++ templates :) -- Nathan Bell wnbell@gmail.com http://graphics.cs.uiuc.edu/~wnbell/
Nathan Bell wrote:
As a general rule, more tests are better. OTOH, tests that people *choose not to run* are not helpful.
Agreed. But you can choose not to run scipy.stats, right ?
I want to know that my changes to scipy.sparse haven't adversely affected other parts of scipy. To my knowledge, there are only a few such modules (io, maxentropy, spatial, and sparse.linalg), so I could, in principle, test those directly and call it a day. However, it's possible that modules that depend on those modules will expose errors that would be hidden otherwise.
Yes, but if you don't run a subset of the tests at all, you run into the same kind of issues anyway, no ? In Scipy, most packages are relatively independent from each other, so a 'fast' mode to check that you did not screw up badly (some import stuff, etc...) is enough most of the time. IOW, I prefer something where you have to explicitly disregard tests rather than explicitly include them.
Still, I'm not giving up my C++ templates :)
:) David
On Mon, Nov 24, 2008 at 3:38 AM, David Cournapeau <david@ar.media.kyoto-u.ac.jp> wrote:
Agreed. But you can choose not to run scipy.stats, right ?
That's right, and I don't currently run those tests. But can a person who changes something in scipy.linalg choose not to run those tests?
Yes, but if you don't run a subset of the tests at all, you run into the same kind of issues anyway, no ? In Scipy, most packages are relatively independent from each other, so a 'fast' mode to check that you did not screw up badly (some import stuff, etc...) is enough most of the time.
IOW, I prefer something where you have to explicitly disregard tests rather than explicitly include them.
I don't understand your argument. You propose to make 'fast' be the thing that developers run before committing changes to SVN and then argue that this will lead to more tests being run? Who runs the slow tests? If you make the default too slow, the *de facto default* will be 'fast' or None :) Passing 'nosetests scipy' should be the standard for modifications to scipy. It should be as comprehensive as possible while running in ~60 seconds. We can have an additional suite of 'slow' tests for releases, build bots, and paranoid developers. -- Nathan Bell wnbell@gmail.com http://graphics.cs.uiuc.edu/~wnbell/
Nathan Bell wrote:
I don't understand your argument. You propose to make 'fast' be the thing that developers run before committing changes to SVN and then argue that this will lead to more tests being run? Who runs the slow tests?
Users. But well, it looks like I am in minority, so let's go for your suggestion. David
Nathan Bell wrote:
I don't understand your argument. You propose to make 'fast' be the thing that developers run before committing changes to SVN and then argue that this will lead to more tests being run? Who runs the slow tests?
Users. But well, it looks like I am in minority, so let's go for your suggestion.
David
Well looks like "unitary tests" versus "integration tests". Sounds good. Many users use the svn (for various reasons).
From there point of view, it could be problem when the svn is really broken. Small test to be quite sure the svn is not broken and extensive tests run once per X and/or by users (after a fresh install)
Xavier
On Mon, Nov 24, 2008 at 4:39 PM, Xavier Gnata <xavier.gnata@gmail.com> wrote:
Nathan Bell wrote:
I don't understand your argument. You propose to make 'fast' be the thing that developers run before committing changes to SVN and then argue that this will lead to more tests being run? Who runs the slow tests?
Users. But well, it looks like I am in minority, so let's go for your suggestion.
David
Well looks like "unitary tests" versus "integration tests". Sounds good. Many users use the svn (for various reasons).
From there point of view, it could be problem when the svn is really broken. Small test to be quite sure the svn is not broken and extensive tests run once per X and/or by users (after a fresh install)
Xavier _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-dev
Now that 0.7 has been tagged, shall I decorate my tests as slow? nosetests -A "not slow" or scipy.test() will then exclude the 4-5 minutes of distributions tests. Without my tests (default setting with not slow) scipy.stats takes 4-6 seconds. I started to profile one of the tests and some distributions are very slow, they provide the correct results but the generic way of calculating takes a lot of time. example: For the R distribution, rdist, the test runs two kolmogorov smirnov tests and has about 4 million function calls to the _pdf function, I guess mostly to generate 2000 random variables in a generic way based only on the pdf. Selectively tagging and excluding time expensive methods, is too much work for me right now, because which methods are expensive depends on what methods are defined in each specific distribution. Josef
On Mon, Nov 24, 2008 at 16:58, <josef.pktd@gmail.com> wrote:
On Mon, Nov 24, 2008 at 4:39 PM, Xavier Gnata <xavier.gnata@gmail.com> wrote:
Nathan Bell wrote:
I don't understand your argument. You propose to make 'fast' be the thing that developers run before committing changes to SVN and then argue that this will lead to more tests being run? Who runs the slow tests?
Users. But well, it looks like I am in minority, so let's go for your suggestion.
David
Well looks like "unitary tests" versus "integration tests". Sounds good. Many users use the svn (for various reasons).
From there point of view, it could be problem when the svn is really broken. Small test to be quite sure the svn is not broken and extensive tests run once per X and/or by users (after a fresh install)
Xavier _______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-dev
Now that 0.7 has been tagged, shall I decorate my tests as slow?
nosetests -A "not slow" or scipy.test() will then exclude the 4-5 minutes of distributions tests. Without my tests (default setting with not slow) scipy.stats takes 4-6 seconds.
I started to profile one of the tests and some distributions are very slow, they provide the correct results but the generic way of calculating takes a lot of time.
example: For the R distribution, rdist, the test runs two kolmogorov smirnov tests and has about 4 million function calls to the _pdf function, I guess mostly to generate 2000 random variables in a generic way based only on the pdf.
I don't think we should be doing any K-S tests of the distributions in the test suite. Once we have validated that our algorithms work (using these tests, with large sample sizes), we should generate a small number of variates from each distribution using a fixed seed. The unit tests in the main test suite will simply generate the same number of variates with the same seed and directly compare the results. If we start to get failures, then we can recheck using the K-S tests that the algorithm is still good, and regenerate the reference variates. The only problem I can see is if there are platform-dependent results for some distributions, but that would be very good to figure out now, too. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
I don't think we should be doing any K-S tests of the distributions in the test suite. Once we have validated that our algorithms work (using these tests, with large sample sizes), we should generate a small number of variates from each distribution using a fixed seed. The unit tests in the main test suite will simply generate the same number of variates with the same seed and directly compare the results. If we start to get failures, then we can recheck using the K-S tests that the algorithm is still good, and regenerate the reference variates.
The only problem I can see is if there are platform-dependent results for some distributions, but that would be very good to figure out now, too.
-- Robert Kern
Currently I am using generated random variables for two purposes: * To test whether the random number generator is correct, kstest or something similar would be necessary, with a large enough sample size for the tests to have reasonable power, similar to the initial kstest in the test suite. (btw. there are still 2 known failures in mtrand) * In the second type of tests, I use the sample properties as a benchmark for the theoretical properties. For this purpose any randomness could be completely removed. Currently the only outside information the tests use comes from numpy.random. e.g. I compare sample moments with theoretical moments. If we have a benchmark for what the true theoretical values should be, then these could be directly compared, without generating a random sample. However, I wasn't willing to go to R and generate benchmark data for 100 or so distributions, so I used the sample properties. Using sample properties and internal consistency between specific and generic methods, creates, I think, quite reliable tests. For this case, we could create now our own benchmark, assuming our algorithms are correct, and use those for regression tests. A simple script should be able to create the benchmark data. One disadvantage of this is that, if we want to test a distribution with different parameter values, we still need to get the benchmark data for the new parameters. When I made changes to, for example, to the behavior of a distribution method at an extreme or close to corner value, I was quite glad I could rely on my tests. I just needed to add a test case with new parameters and the tests checked all methods for this case, without me having to specify expected results for each method. I don't know how everyone is handling this, but I need to keep track of a public test suite (for those not working on distributions) and a "development" test suite, which is much stricter, and that I use when I make changes directly to the distribution module. But, I agree, for the purpose of a regression test suite, there is a large amount of simplification that can be done to my (bug-hunting) test suite. Josef
I reorganized the tests for stats.distributions. On my notebook, I have now 25 to 30 seconds for `nosetests -A "not slow" scipy.stats` (equivalents to scipy.stats.test() ); Without the "not slow" option, I still have around 5 minutes I did not change the actual tests, I merged some tests to reuse generated random variables, and, after profiling, I moved some of the slowest continuous distribution into a separate test which is decorated with slow. The basic tests, including kstest, are now run by default for 70 out of 84 continuous and all discrete distributions. I also reduced the sample size and fixed the see, and I hope that we don't get spurious random failures. I hope this time consumption of the tests is ok for now. Further test optimization has to wait. Josef
On Wed, Nov 26, 2008 at 4:29 PM, <josef.pktd@gmail.com> wrote:
I reorganized the tests for stats.distributions.
On my notebook, I have now 25 to 30 seconds for `nosetests -A "not slow" scipy.stats` (equivalents to scipy.stats.test() ); Without the "not slow" option, I still have around 5 minutes
I hope this time consumption of the tests is ok for now. Further test optimization has to wait.
Thanks Josef, that makes things a lot better. Also, thank you for your other improvements to scipy.stats. -- Nathan Bell wnbell@gmail.com http://graphics.cs.uiuc.edu/~wnbell/
On Mon, Nov 24, 2008 at 1:39 AM, Nathan Bell <wnbell@gmail.com> wrote:
Passing 'nosetests scipy' should be the standard for modifications to scipy. It should be as comprehensive as possible while running in ~60 seconds. We can have an additional suite of 'slow' tests for releases, build bots, and paranoid developers.
I completely agree. And it would be fairly easy to do with out much work, which is a big plus. -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/
participants (10)
-
David Cournapeau -
David Cournapeau -
Fernando Perez -
Gael Varoquaux -
Jarrod Millman -
josef.pktd@gmail.com -
Matthieu Brucher -
Nathan Bell -
Robert Kern -
Xavier Gnata