[SciPy-Dev] New subpackage: scipy.data

Mark Alexander Mikofski mikofski at berkeley.edu
Fri Mar 30 15:23:22 EDT 2018


I agree with what's above. Basically (1) move small datasets to centralized
scipy.datasets for testing, demos, docs, and short examples, and (2) move
large, realistic datasets to shared repo or common site like rdatasets and
explain in docs how to retrieve them. These longer tutorials could be in
Jupyter notebooks, for example.

On Mar 30, 2018 7:30 AM, <josef.pktd at gmail.com> wrote:

On Fri, Mar 30, 2018 at 9:54 AM, Eric Larson <larson.eric.d at gmail.com>
wrote:
>>> It depends on the scale where this should go.
>
> In this particular case ("scipy.signal currently has no useful realistic
> signals"), if we add the proposed ~100 kB data file, I suspect that we can
> greatly enhance a large number of our scipy.signal examples. An ECG signal
> won't be perfect for all of them, but in many cases it will be a lot
better
> and more instructive for users than what we can currently synthesize
> ourselves (while keeping synthesis sufficiently simple at least).
>
> Compared to a general dataset-fetching utility, the in-repo approach has
> clear disadvantages in terms of being incomplete and adding to repo size.
> Its advantages are in terms of simplifying doc building, access,
> maintenance, uniformity of functionality (benchmarks, Debian unit tests,
doc
> building, etc.). On the balance, this makes it worth having IMO.
>
>> For example, a dataset package also runs into the problem how much to
>> include.
>
>
> A proposed rule of thumb: SciPy can have (up to) a couple of small-sized
> files per module shipped with the repo in cases where such files greatly
> improve our ability to showcase/test/document functionality
(benchmarks/unit
> tests/docstrings). This forces us to make subjective judgments about what
> will be sufficiently useful, sufficiently small, and sufficiently
impactful
> for the module, but I think this will be a rare enough phenomenon that
it's
> okay.
>
> In other words, I propose that scipy.datasets not provide an exhaustive or
> even extensive resource of data for users, but rather a minimal one for
> showcasing functionality. This seems consistent with what we already do
with
> ascent/face, in that they improve the image-processing examples.
>
>> We've been doing this in scikit-image for a long time, and now regret
>> having any binary data in the repository
>
>
> I have had a similar problem while maintaining MNE-Python, which has some
> files in the repo and others in a GitHub repo (downloaded separately for
> testing). I have a similar feeling about the files that live in the repo
> today. However, for SciPy the problem seems a bit different in scope and
> scale -- a handful of small files can go a long way for SciPy, which isn't
> the case for MNE (and I would assume also many functions in scikit-image).
>
>> both scikit-learn and scikit-image use access to larger datasets.
>
>
> There are other projects that also do this (MNE has huge ones hosted on
> osf.io, VisPy hosts data on GitHub). It would be awesome if someone
unified
> all this stuff for cases where you want to deal with getting large
datasets,
> or many different datasets.


just to say:
I agree with all of this,and think it is a very good summary of the issues

Josef


>
> My 2c,
> Eric
>
>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
>
_______________________________________________
SciPy-Dev mailing list
SciPy-Dev at python.org
https://mail.python.org/mailman/listinfo/scipy-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20180330/1459e6e5/attachment.html>


More information about the SciPy-Dev mailing list