I also think that at most small datasets should be included in scipy directly. But I think that for online storage scipy would be better off following some other packages. Stefan mentions some attempts to get to a common format. AFAIK without being up to date, both scikit-learn and scikit-image use access to larger datasets. For example, a dataset package also runs into the problem how much to include. I wouldn't install a dataset package with a few gigabyte of data if I'm only interested in a tiny fraction for the examples that are relevant to me. (I'm not into analyzing images, movies or BIG DATA.) Josef On Thu, Mar 29, 2018 at 8:10 PM, Ilhan Polat <ilhanpolat@gmail.com> wrote:
Yes, that's true but GitHub seems like a robust place to live. Otherwise we can just point to any hardcoded URL. But if the size gets bigger in terms of wheels and cloning I think within SciPy doesn't seem to be a viable option. These all depend on what the future of datasets would be.
On Fri, Mar 30, 2018 at 2:03 AM, <josef.pktd@gmail.com> wrote:
On Thu, Mar 29, 2018 at 7:54 PM, Ilhan Polat <ilhanpolat@gmail.com> wrote:
Would a separate repo scipy-datasets help ? Then something like
try: importing except: warn('I'm off to interwebz') download from the repo
might be feasible. The download part can either be that particular dataset or the whole scipy-datasets clone.
IMO:
It depends on the scale where this should go. I don't think it's worth it (maintaining and installing another package or repo) for scipy given that scipy is mostly a basic numerical library and not driven by specific applications.
For most areas there should be already some online repos or packages and it would be enough to have the accessing functions in scipy.datasets. The only area that I can think of where there might not be some readily available online source for datasets is signal.
Josef
On Fri, Mar 30, 2018 at 1:16 AM, Stefan van der Walt <stefanv@berkeley.edu> wrote:
On Thu, 29 Mar 2018 18:54:52 -0400, Warren Weckesser wrote:
Can you summarize the problems that make you regret including the data?
- The size of the repository (extra time on each clone, and that for data that isn't necessary in most use cases)
- Artificial limit on data sizes: we now have a default place to store data, but we still need an additional mechanism for larger datasets. How do you choose the threshold for what goes in, what is too big?
- Because these tiny embedded datasets are easily available, they become the default for demos. If data is stored externally, realistic examples become more feasible and likely.
Best regards Stéfan _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev