New subject: New subpackage: scipy.data

6 Apr 2018


      I wanted to extend Charles' comment that in addition to the test problems
there are optimization *benchmarks*. For instance, in scipy/benchmarks/
benchmarks/linprog_benchmark_files there are ~90 benchmark problems (all
NETLIB LP benchmarks as .npz files) totaling ~12MB. The current linprog
benchmark only uses two of them by default. Sounds like these should be
moved if space is such a concern.

On Wed, Apr 4, 2018 at 7:54 AM, Charles R Harris <charlesr.harris@gmail.com>
wrote:
...
On Mon, Apr 2, 2018 at 12:50 PM, Warren Weckesser <
warren.weckesser@gmail.com> wrote:
...
On Fri, Mar 30, 2018 at 8:17 PM, Ralf Gommers <ralf.gommers@gmail.com>
wrote:
...
On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d@gmail.com>
wrote:
...
Top-level module for them alone sounds overkill, and I'm not sure if
...
discoverability alone is enough.
Fine by me. And if we follow the idea that these should be added
sparingly, we can maintain discoverability without it growing out of
hand by populating the See Also sections of each function.
I agree with this, the 2 images and 1 ECG signal (to be added) that we
have doesn't justify a top-level module. We don't want to grow more than
the absolute minimum of datasets. The package is already very large, which
is problematic in certain cases. E.g. numpy + scipy still fits in the AWS
Lambda limit of 50 MB, but there's not much margin.
Note: this is a reply to the thread, and not specifically to Ralf's
comments (but those are included).
After reading all the replies, the first question that comes to mind is:
should SciPy have *any* datasets?
I think this question has already been answered: we have had functions
that return images in scipy.misc for a long time, and I don't recall anyone
ever suggesting that these be removed.  (Well, there was lena(), but I
don't think anyone had a problem with adding a replacement image.)  And the
pull request for the ECG dataset has been merged (added to scipy.misc), so
there is current support among the developers for providing datasets.
So the remaining questions are:
   (1) Where do the datasets reside?
   (2) What are the criteria for adding a new datasets?
Here's my 2¢:
(1) Where do the datasets reside?
My preference is to keep all the datasets in the top-level module
scipy.datasets. Robert preferred this module for discoverability, and I
agree.  By having all the datasets in one place, anyone can easily see what
is available.  Teachers and others developing educational material know
where to find source material for examples.  Developers, too, can easily
look for examples to use in our docstrings or tutorials. (By the way,
adding examples to the docstrings of all functions is an ongoing effort:
https://github.com/scipy/scipy/issues/7168.)
Also, there are many well-known datasets that could be used as examples
for multiple scipy packages.  For a concrete example, a dataset that I
could see adding to scipy is the Hald cement dataset.  SciPy should
eventually have an implementation of the PCA decomposition, and it could
conceivably live in scipy.linalg.  It would be reasonable to use the Hald
data in the docstrings of the new PCA function(s) (cf.
https://www.mathworks.com/help/stats/pca.html).  At the same time, the
Hald data could enrich the docstrings of some functions in scipy.stats.
Similarly, Fisher's iris dataset provides a well-known example that could
be used in docstrings in both scipy.cluster and scipy.stats.
(2) What are the criteria for adding a new datasets?
So far, the only compelling reason I can see to even have datasets is to
have interesting examples in the docstrings (or at least in our
tutorials).  For example, the docstring for scipy.ndimage.gaussian_filter
and several other transformations in ndimage use the image returned by
scipy.misc.ascent():
https://docs.scipy.org/doc/scipy/reference/generated/scipy.
ndimage.gaussian_filter.html
I could see the benefit of having well-known datasets such as Fisher's
iris data, the Hald cement data, and some version of a sunspot activity
time series, to be used in the docstrings in scipy.stats, scipy.signal,
scipy.cluster, scipy.linalg, and elsewhere.
Stéfan expressed regret about including datasets in sciki-image.  The
main issue seems to be "bloat".  Scikit-image is an image processing
library, so the datasets used there are likely all images, and there is a
minimum size for a sample image to be useful as an example.  For scipy, we
already have two images, and I don't think we'll need more.  The newly
added ECG dataset is 116K (which is less than the existing image datasets:
"ascent.dat" is 515K and "face.dat" is 1.5M).  The potential datasets that
I mentioned above (Hald, iris, sunspots) are all very small.   If we are
conservative about what we include, and focus on datasets chosen
specifically to demonstrate scipy functionality, we should be able to avoid
dataset bloat.
This leads to my proposal for the criteria for adding a dataset:
(a) Not too big.  The size of a dataset should not exceed $MAX (but I
don't have a good suggestion for what $MAX should be at the moment).
(b) The dataset should be well-known, where "well-known" means that the
dataset is one that is already widely used as an example and many people
will know it by name (e.g. the iris dataset), or the dataset is a sample of
a common signal type or format (e.g an ECG signal, or an image such as
misc.ascent).
(c) We actually *use* the dataset in one of *our* docstrings or
tutorials.  I don't think our datasets package should become a repository
of interesting scientific data with no connection to the scipy code.  Its
purpose should be to enrich our documentation.  (Note that by this
criterion, the recently added ECG signal would not qualify!)
To summarize: I'm in favor scipy.datasets, a conservatively curated
subpackage containing well-known datasets.
There are also some standard functions used for testing optimization. I
wonder if it would be reasonable to make those public?
Chuck
_______________________________________________
SciPy-Dev mailing list
SciPy-Dev@python.org
https://mail.python.org/mailman/listinfo/scipy-dev
-- 
Matt Haberland
Assistant Adjunct Professor in the Program in Computing
Department of Mathematics
6617A Math Sciences Building, UCLA

Re: [SciPy-Dev] New subpackage: scipy.data

Matt Haberland

Pauli Virtanen

tags

participants (2)