[SciPy-Dev] New subpackage: scipy.data

Charles R Harris charlesr.harris at gmail.com
Wed Apr 4 10:54:41 EDT 2018


On Mon, Apr 2, 2018 at 12:50 PM, Warren Weckesser <
warren.weckesser at gmail.com> wrote:

>
>
> On Fri, Mar 30, 2018 at 8:17 PM, Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
>
>>
>>
>> On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d at gmail.com>
>> wrote:
>>
>>> Top-level module for them alone sounds overkill, and I'm not sure if
>>>> discoverability alone is enough.
>>>>
>>>
>>> Fine by me. And if we follow the idea that these should be added
>>> sparingly, we can maintain discoverability without it growing out of
>>> hand by populating the See Also sections of each function.
>>>
>>
>> I agree with this, the 2 images and 1 ECG signal (to be added) that we
>> have doesn't justify a top-level module. We don't want to grow more than
>> the absolute minimum of datasets. The package is already very large, which
>> is problematic in certain cases. E.g. numpy + scipy still fits in the AWS
>> Lambda limit of 50 MB, but there's not much margin.
>>
>>
>
> Note: this is a reply to the thread, and not specifically to Ralf's
> comments (but those are included).
>
> After reading all the replies, the first question that comes to mind is:
> should SciPy have *any* datasets?
>
> I think this question has already been answered: we have had functions
> that return images in scipy.misc for a long time, and I don't recall anyone
> ever suggesting that these be removed.  (Well, there was lena(), but I
> don't think anyone had a problem with adding a replacement image.)  And the
> pull request for the ECG dataset has been merged (added to scipy.misc), so
> there is current support among the developers for providing datasets.
>
> So the remaining questions are:
>    (1) Where do the datasets reside?
>    (2) What are the criteria for adding a new datasets?
>
> Here's my 2¢:
>
> (1) Where do the datasets reside?
>
> My preference is to keep all the datasets in the top-level module
> scipy.datasets. Robert preferred this module for discoverability, and I
> agree.  By having all the datasets in one place, anyone can easily see what
> is available.  Teachers and others developing educational material know
> where to find source material for examples.  Developers, too, can easily
> look for examples to use in our docstrings or tutorials. (By the way,
> adding examples to the docstrings of all functions is an ongoing effort:
> https://github.com/scipy/scipy/issues/7168.)
>
> Also, there are many well-known datasets that could be used as examples
> for multiple scipy packages.  For a concrete example, a dataset that I
> could see adding to scipy is the Hald cement dataset.  SciPy should
> eventually have an implementation of the PCA decomposition, and it could
> conceivably live in scipy.linalg.  It would be reasonable to use the Hald
> data in the docstrings of the new PCA function(s) (cf.
> https://www.mathworks.com/help/stats/pca.html).  At the same time, the
> Hald data could enrich the docstrings of some functions in scipy.stats.
>
> Similarly, Fisher's iris dataset provides a well-known example that could
> be used in docstrings in both scipy.cluster and scipy.stats.
>
>
> (2) What are the criteria for adding a new datasets?
>
> So far, the only compelling reason I can see to even have datasets is to
> have interesting examples in the docstrings (or at least in our
> tutorials).  For example, the docstring for scipy.ndimage.gaussian_filter
> and several other transformations in ndimage use the image returned by
> scipy.misc.ascent():
>
>     https://docs.scipy.org/doc/scipy/reference/generated/
> scipy.ndimage.gaussian_filter.html
>
> I could see the benefit of having well-known datasets such as Fisher's
> iris data, the Hald cement data, and some version of a sunspot activity
> time series, to be used in the docstrings in scipy.stats, scipy.signal,
> scipy.cluster, scipy.linalg, and elsewhere.
>
> Stéfan expressed regret about including datasets in sciki-image.  The main
> issue seems to be "bloat".  Scikit-image is an image processing library, so
> the datasets used there are likely all images, and there is a minimum size
> for a sample image to be useful as an example.  For scipy, we already have
> two images, and I don't think we'll need more.  The newly added ECG dataset
> is 116K (which is less than the existing image datasets: "ascent.dat" is
> 515K and "face.dat" is 1.5M).  The potential datasets that I mentioned
> above (Hald, iris, sunspots) are all very small.   If we are conservative
> about what we include, and focus on datasets chosen specifically to
> demonstrate scipy functionality, we should be able to avoid dataset bloat.
>
> This leads to my proposal for the criteria for adding a dataset:
>
> (a) Not too big.  The size of a dataset should not exceed $MAX (but I
> don't have a good suggestion for what $MAX should be at the moment).
> (b) The dataset should be well-known, where "well-known" means that the
> dataset is one that is already widely used as an example and many people
> will know it by name (e.g. the iris dataset), or the dataset is a sample of
> a common signal type or format (e.g an ECG signal, or an image such as
> misc.ascent).
> (c) We actually *use* the dataset in one of *our* docstrings or
> tutorials.  I don't think our datasets package should become a repository
> of interesting scientific data with no connection to the scipy code.  Its
> purpose should be to enrich our documentation.  (Note that by this
> criterion, the recently added ECG signal would not qualify!)
>
> To summarize: I'm in favor scipy.datasets, a conservatively curated
> subpackage containing well-known datasets.
>
>
There are also some standard functions used for testing optimization. I
wonder if it would be reasonable to make those public?

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180404/8f3f3b4b/attachment.html>


More information about the SciPy-Dev mailing list