[SciPy-Dev] New subpackage: scipy.data

Fri Apr 6 16:46:57 EDT 2018

pe, 2018-04-06 kello 12:08 -0700, Matt Haberland kirjoitti:
>  I wanted to extend Charles' comment that in addition to the test
> problems
> there are optimization *benchmarks*. For instance, in
> scipy/benchmarks/
> benchmarks/linprog_benchmark_files there are ~90 benchmark problems
> (all
> NETLIB LP benchmarks as .npz files) totaling ~12MB. The current
> linprog
> benchmark only uses two of them by default. Sounds like these should
> be
> moved if space is such a concern.

Note that "pip install scipy" does not install those files, so that it
is off less concern for deployment on resource-limited machines.

	Pauli

> 
> On Wed, Apr 4, 2018 at 7:54 AM, Charles R Harris <charlesr.harris at gma
> il.com>
> wrote:
> 
> > 
> > 
> > On Mon, Apr 2, 2018 at 12:50 PM, Warren Weckesser <
> > warren.weckesser at gmail.com> wrote:
> > 
> > > 
> > > 
> > > On Fri, Mar 30, 2018 at 8:17 PM, Ralf Gommers <ralf.gommers at gmail
> > > .com>
> > > wrote:
> > > 
> > > > 
> > > > 
> > > > On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d at gm
> > > > ail.com>
> > > > wrote:
> > > > 
> > > > > Top-level module for them alone sounds overkill, and I'm not
> > > > > sure if
> > > > > > discoverability alone is enough.
> > > > > > 
> > > > > 
> > > > > Fine by me. And if we follow the idea that these should be
> > > > > added
> > > > > sparingly, we can maintain discoverability without it growing
> > > > > out of
> > > > > hand by populating the See Also sections of each function.
> > > > > 
> > > > 
> > > > I agree with this, the 2 images and 1 ECG signal (to be added)
> > > > that we
> > > > have doesn't justify a top-level module. We don't want to grow
> > > > more than
> > > > the absolute minimum of datasets. The package is already very
> > > > large, which
> > > > is problematic in certain cases. E.g. numpy + scipy still fits
> > > > in the AWS
> > > > Lambda limit of 50 MB, but there's not much margin.
> > > > 
> > > > 
> > > 
> > > Note: this is a reply to the thread, and not specifically to
> > > Ralf's
> > > comments (but those are included).
> > > 
> > > After reading all the replies, the first question that comes to
> > > mind is:
> > > should SciPy have *any* datasets?
> > > 
> > > I think this question has already been answered: we have had
> > > functions
> > > that return images in scipy.misc for a long time, and I don't
> > > recall anyone
> > > ever suggesting that these be removed.  (Well, there was lena(),
> > > but I
> > > don't think anyone had a problem with adding a replacement
> > > image.)  And the
> > > pull request for the ECG dataset has been merged (added to
> > > scipy.misc), so
> > > there is current support among the developers for providing
> > > datasets.
> > > 
> > > So the remaining questions are:
> > >    (1) Where do the datasets reside?
> > >    (2) What are the criteria for adding a new datasets?
> > > 
> > > Here's my 2¢:
> > > 
> > > (1) Where do the datasets reside?
> > > 
> > > My preference is to keep all the datasets in the top-level module
> > > scipy.datasets. Robert preferred this module for discoverability,
> > > and I
> > > agree.  By having all the datasets in one place, anyone can
> > > easily see what
> > > is available.  Teachers and others developing educational
> > > material know
> > > where to find source material for examples.  Developers, too, can
> > > easily
> > > look for examples to use in our docstrings or tutorials. (By the
> > > way,
> > > adding examples to the docstrings of all functions is an ongoing
> > > effort:
> > > https://github.com/scipy/scipy/issues/7168.)
> > > 
> > > Also, there are many well-known datasets that could be used as
> > > examples
> > > for multiple scipy packages.  For a concrete example, a dataset
> > > that I
> > > could see adding to scipy is the Hald cement dataset.  SciPy
> > > should
> > > eventually have an implementation of the PCA decomposition, and
> > > it could
> > > conceivably live in scipy.linalg.  It would be reasonable to use
> > > the Hald
> > > data in the docstrings of the new PCA function(s) (cf.
> > > https://www.mathworks.com/help/stats/pca.html).  At the same
> > > time, the
> > > Hald data could enrich the docstrings of some functions in
> > > scipy.stats.
> > > 
> > > Similarly, Fisher's iris dataset provides a well-known example
> > > that could
> > > be used in docstrings in both scipy.cluster and scipy.stats.
> > > 
> > > 
> > > (2) What are the criteria for adding a new datasets?
> > > 
> > > So far, the only compelling reason I can see to even have
> > > datasets is to
> > > have interesting examples in the docstrings (or at least in our
> > > tutorials).  For example, the docstring for
> > > scipy.ndimage.gaussian_filter
> > > and several other transformations in ndimage use the image
> > > returned by
> > > scipy.misc.ascent():
> > > 
> > >     https://docs.scipy.org/doc/scipy/reference/generated/scipy.
> > > ndimage.gaussian_filter.html
> > > 
> > > I could see the benefit of having well-known datasets such as
> > > Fisher's
> > > iris data, the Hald cement data, and some version of a sunspot
> > > activity
> > > time series, to be used in the docstrings in scipy.stats,
> > > scipy.signal,
> > > scipy.cluster, scipy.linalg, and elsewhere.
> > > 
> > > Stéfan expressed regret about including datasets in sciki-
> > > image.  The
> > > main issue seems to be "bloat".  Scikit-image is an image
> > > processing
> > > library, so the datasets used there are likely all images, and
> > > there is a
> > > minimum size for a sample image to be useful as an example.  For
> > > scipy, we
> > > already have two images, and I don't think we'll need more.  The
> > > newly
> > > added ECG dataset is 116K (which is less than the existing image
> > > datasets:
> > > "ascent.dat" is 515K and "face.dat" is 1.5M).  The potential
> > > datasets that
> > > I mentioned above (Hald, iris, sunspots) are all very small.   If
> > > we are
> > > conservative about what we include, and focus on datasets chosen
> > > specifically to demonstrate scipy functionality, we should be
> > > able to avoid
> > > dataset bloat.
> > > 
> > > This leads to my proposal for the criteria for adding a dataset:
> > > 
> > > (a) Not too big.  The size of a dataset should not exceed $MAX
> > > (but I
> > > don't have a good suggestion for what $MAX should be at the
> > > moment).
> > > (b) The dataset should be well-known, where "well-known" means
> > > that the
> > > dataset is one that is already widely used as an example and many
> > > people
> > > will know it by name (e.g. the iris dataset), or the dataset is a
> > > sample of
> > > a common signal type or format (e.g an ECG signal, or an image
> > > such as
> > > misc.ascent).
> > > (c) We actually *use* the dataset in one of *our* docstrings or
> > > tutorials.  I don't think our datasets package should become a
> > > repository
> > > of interesting scientific data with no connection to the scipy
> > > code.  Its
> > > purpose should be to enrich our documentation.  (Note that by
> > > this
> > > criterion, the recently added ECG signal would not qualify!)
> > > 
> > > To summarize: I'm in favor scipy.datasets, a conservatively
> > > curated
> > > subpackage containing well-known datasets.
> > > 
> > > 
> > 
> > There are also some standard functions used for testing
> > optimization. I
> > wonder if it would be reasonable to make those public?
> > 
> > Chuck
> > 
> > 
> > _______________________________________________
> > SciPy-Dev mailing list
> > SciPy-Dev at python.org
> > https://mail.python.org/mailman/listinfo/scipy-dev
> > 
> > 
> 
> 
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev