Re: [SciPy-Dev] New subpackage: scipy.data
I wanted to extend Charles' comment that in addition to the test problems there are optimization *benchmarks*. For instance, in scipy/benchmarks/ benchmarks/linprog_benchmark_files there are ~90 benchmark problems (all NETLIB LP benchmarks as .npz files) totaling ~12MB. The current linprog benchmark only uses two of them by default. Sounds like these should be moved if space is such a concern. On Wed, Apr 4, 2018 at 7:54 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Mon, Apr 2, 2018 at 12:50 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Fri, Mar 30, 2018 at 8:17 PM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d@gmail.com> wrote:
Top-level module for them alone sounds overkill, and I'm not sure if
discoverability alone is enough.
Fine by me. And if we follow the idea that these should be added sparingly, we can maintain discoverability without it growing out of hand by populating the See Also sections of each function.
I agree with this, the 2 images and 1 ECG signal (to be added) that we have doesn't justify a top-level module. We don't want to grow more than the absolute minimum of datasets. The package is already very large, which is problematic in certain cases. E.g. numpy + scipy still fits in the AWS Lambda limit of 50 MB, but there's not much margin.
Note: this is a reply to the thread, and not specifically to Ralf's comments (but those are included).
After reading all the replies, the first question that comes to mind is: should SciPy have *any* datasets?
I think this question has already been answered: we have had functions that return images in scipy.misc for a long time, and I don't recall anyone ever suggesting that these be removed. (Well, there was lena(), but I don't think anyone had a problem with adding a replacement image.) And the pull request for the ECG dataset has been merged (added to scipy.misc), so there is current support among the developers for providing datasets.
So the remaining questions are: (1) Where do the datasets reside? (2) What are the criteria for adding a new datasets?
Here's my 2¢:
(1) Where do the datasets reside?
My preference is to keep all the datasets in the top-level module scipy.datasets. Robert preferred this module for discoverability, and I agree. By having all the datasets in one place, anyone can easily see what is available. Teachers and others developing educational material know where to find source material for examples. Developers, too, can easily look for examples to use in our docstrings or tutorials. (By the way, adding examples to the docstrings of all functions is an ongoing effort: https://github.com/scipy/scipy/issues/7168.)
Also, there are many well-known datasets that could be used as examples for multiple scipy packages. For a concrete example, a dataset that I could see adding to scipy is the Hald cement dataset. SciPy should eventually have an implementation of the PCA decomposition, and it could conceivably live in scipy.linalg. It would be reasonable to use the Hald data in the docstrings of the new PCA function(s) (cf. https://www.mathworks.com/help/stats/pca.html). At the same time, the Hald data could enrich the docstrings of some functions in scipy.stats.
Similarly, Fisher's iris dataset provides a well-known example that could be used in docstrings in both scipy.cluster and scipy.stats.
(2) What are the criteria for adding a new datasets?
So far, the only compelling reason I can see to even have datasets is to have interesting examples in the docstrings (or at least in our tutorials). For example, the docstring for scipy.ndimage.gaussian_filter and several other transformations in ndimage use the image returned by scipy.misc.ascent():
https://docs.scipy.org/doc/scipy/reference/generated/scipy. ndimage.gaussian_filter.html
I could see the benefit of having well-known datasets such as Fisher's iris data, the Hald cement data, and some version of a sunspot activity time series, to be used in the docstrings in scipy.stats, scipy.signal, scipy.cluster, scipy.linalg, and elsewhere.
Stéfan expressed regret about including datasets in sciki-image. The main issue seems to be "bloat". Scikit-image is an image processing library, so the datasets used there are likely all images, and there is a minimum size for a sample image to be useful as an example. For scipy, we already have two images, and I don't think we'll need more. The newly added ECG dataset is 116K (which is less than the existing image datasets: "ascent.dat" is 515K and "face.dat" is 1.5M). The potential datasets that I mentioned above (Hald, iris, sunspots) are all very small. If we are conservative about what we include, and focus on datasets chosen specifically to demonstrate scipy functionality, we should be able to avoid dataset bloat.
This leads to my proposal for the criteria for adding a dataset:
(a) Not too big. The size of a dataset should not exceed $MAX (but I don't have a good suggestion for what $MAX should be at the moment). (b) The dataset should be well-known, where "well-known" means that the dataset is one that is already widely used as an example and many people will know it by name (e.g. the iris dataset), or the dataset is a sample of a common signal type or format (e.g an ECG signal, or an image such as misc.ascent). (c) We actually *use* the dataset in one of *our* docstrings or tutorials. I don't think our datasets package should become a repository of interesting scientific data with no connection to the scipy code. Its purpose should be to enrich our documentation. (Note that by this criterion, the recently added ECG signal would not qualify!)
To summarize: I'm in favor scipy.datasets, a conservatively curated subpackage containing well-known datasets.
There are also some standard functions used for testing optimization. I wonder if it would be reasonable to make those public?
Chuck
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
-- Matt Haberland Assistant Adjunct Professor in the Program in Computing Department of Mathematics 6617A Math Sciences Building, UCLA
pe, 2018-04-06 kello 12:08 -0700, Matt Haberland kirjoitti:
I wanted to extend Charles' comment that in addition to the test problems there are optimization *benchmarks*. For instance, in scipy/benchmarks/ benchmarks/linprog_benchmark_files there are ~90 benchmark problems (all NETLIB LP benchmarks as .npz files) totaling ~12MB. The current linprog benchmark only uses two of them by default. Sounds like these should be moved if space is such a concern.
Note that "pip install scipy" does not install those files, so that it is off less concern for deployment on resource-limited machines. Pauli
On Wed, Apr 4, 2018 at 7:54 AM, Charles R Harris <charlesr.harris@gma il.com> wrote:
On Mon, Apr 2, 2018 at 12:50 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:
On Fri, Mar 30, 2018 at 8:17 PM, Ralf Gommers <ralf.gommers@gmail .com> wrote:
On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d@gm ail.com> wrote:
Top-level module for them alone sounds overkill, and I'm not sure if
discoverability alone is enough.
Fine by me. And if we follow the idea that these should be added sparingly, we can maintain discoverability without it growing out of hand by populating the See Also sections of each function.
I agree with this, the 2 images and 1 ECG signal (to be added) that we have doesn't justify a top-level module. We don't want to grow more than the absolute minimum of datasets. The package is already very large, which is problematic in certain cases. E.g. numpy + scipy still fits in the AWS Lambda limit of 50 MB, but there's not much margin.
Note: this is a reply to the thread, and not specifically to Ralf's comments (but those are included).
After reading all the replies, the first question that comes to mind is: should SciPy have *any* datasets?
I think this question has already been answered: we have had functions that return images in scipy.misc for a long time, and I don't recall anyone ever suggesting that these be removed. (Well, there was lena(), but I don't think anyone had a problem with adding a replacement image.) And the pull request for the ECG dataset has been merged (added to scipy.misc), so there is current support among the developers for providing datasets.
So the remaining questions are: (1) Where do the datasets reside? (2) What are the criteria for adding a new datasets?
Here's my 2¢:
(1) Where do the datasets reside?
My preference is to keep all the datasets in the top-level module scipy.datasets. Robert preferred this module for discoverability, and I agree. By having all the datasets in one place, anyone can easily see what is available. Teachers and others developing educational material know where to find source material for examples. Developers, too, can easily look for examples to use in our docstrings or tutorials. (By the way, adding examples to the docstrings of all functions is an ongoing effort: https://github.com/scipy/scipy/issues/7168.)
Also, there are many well-known datasets that could be used as examples for multiple scipy packages. For a concrete example, a dataset that I could see adding to scipy is the Hald cement dataset. SciPy should eventually have an implementation of the PCA decomposition, and it could conceivably live in scipy.linalg. It would be reasonable to use the Hald data in the docstrings of the new PCA function(s) (cf. https://www.mathworks.com/help/stats/pca.html). At the same time, the Hald data could enrich the docstrings of some functions in scipy.stats.
Similarly, Fisher's iris dataset provides a well-known example that could be used in docstrings in both scipy.cluster and scipy.stats.
(2) What are the criteria for adding a new datasets?
So far, the only compelling reason I can see to even have datasets is to have interesting examples in the docstrings (or at least in our tutorials). For example, the docstring for scipy.ndimage.gaussian_filter and several other transformations in ndimage use the image returned by scipy.misc.ascent():
https://docs.scipy.org/doc/scipy/reference/generated/scipy. ndimage.gaussian_filter.html
I could see the benefit of having well-known datasets such as Fisher's iris data, the Hald cement data, and some version of a sunspot activity time series, to be used in the docstrings in scipy.stats, scipy.signal, scipy.cluster, scipy.linalg, and elsewhere.
Stéfan expressed regret about including datasets in sciki- image. The main issue seems to be "bloat". Scikit-image is an image processing library, so the datasets used there are likely all images, and there is a minimum size for a sample image to be useful as an example. For scipy, we already have two images, and I don't think we'll need more. The newly added ECG dataset is 116K (which is less than the existing image datasets: "ascent.dat" is 515K and "face.dat" is 1.5M). The potential datasets that I mentioned above (Hald, iris, sunspots) are all very small. If we are conservative about what we include, and focus on datasets chosen specifically to demonstrate scipy functionality, we should be able to avoid dataset bloat.
This leads to my proposal for the criteria for adding a dataset:
(a) Not too big. The size of a dataset should not exceed $MAX (but I don't have a good suggestion for what $MAX should be at the moment). (b) The dataset should be well-known, where "well-known" means that the dataset is one that is already widely used as an example and many people will know it by name (e.g. the iris dataset), or the dataset is a sample of a common signal type or format (e.g an ECG signal, or an image such as misc.ascent). (c) We actually *use* the dataset in one of *our* docstrings or tutorials. I don't think our datasets package should become a repository of interesting scientific data with no connection to the scipy code. Its purpose should be to enrich our documentation. (Note that by this criterion, the recently added ECG signal would not qualify!)
To summarize: I'm in favor scipy.datasets, a conservatively curated subpackage containing well-known datasets.
There are also some standard functions used for testing optimization. I wonder if it would be reasonable to make those public?
Chuck
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
participants (2)
-
Matt Haberland
-
Pauli Virtanen