[SciPy-Dev] New subpackage: scipy.data
josef.pktd at gmail.com
josef.pktd at gmail.com
Thu Mar 29 19:44:18 EDT 2018
On Thu, Mar 29, 2018 at 7:16 PM, Stefan van der Walt
<stefanv at berkeley.edu> wrote:
> On Thu, 29 Mar 2018 18:54:52 -0400, Warren Weckesser wrote:
>> Can you summarize the problems that make you regret including the
>> data?
>
> - The size of the repository (extra time on each clone, and that for
> data that isn't necessary in most use cases)
>
> - Artificial limit on data sizes: we now have a default place to store
> data, but we still need an additional mechanism for larger datasets.
> How do you choose the threshold for what goes in, what is too big?
>
> - Because these tiny embedded datasets are easily available, they become
> the default for demos. If data is stored externally, realistic
> examples become more feasible and likely.
In statsmodels we included datasets from the beginning both for unit tests
and for examples. By today's standard these are almost all tiny datasets.
The advantage is that many of them are old textbook dataset that often
illustrate
a problem that we can run into, while clean random generated
data is often boring.
Unit test don't have access to the internet on Debian, so there is still
the restriction of either using internal data or random data.
For notebook we rely now often on downloading from `rdatasets`, or
even having the user download a zip file if the license situation is not
clear, e.g. downloading from the supplementary material to books.
About tools for downloading datasets:
We have a helper function to download from rdatasets and a helper
function to download Stata files from the internet. Essentially all
other datasets are handled by pandas.
It's a simpler case for statsmodels because all datasets essentially
correspond to a csv file that might be stored in another format.
Josef
>
> Best regards
> Stéfan
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at python.org
> https://mail.python.org/mailman/listinfo/scipy-dev
More information about the SciPy-Dev
mailing list