Mailman 3 New subpackage: scipy.data - SciPy-Dev

newer
[GSOC 2018 Project Thread]:...

New subpackage: scipy.data

Warren Weckesser

29 Mar 2018 29 Mar '18

3:43 p.m.

According to the SciPy roadmap ( https://github.com/scipy/scipy/blob/master/doc/ROADMAP.rst.txt#misc), `scipy.misc` will eventually be removed. Currently, the combinatorial functions and the image-related operations are all deprecated. The only non-deprecated functions in `misc` are `central_diff_weights()`, `derivative()` and the two functions that return image data: `ascent()` and `face()`. As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there. An early version of the roadmap suggested moving the images to `scipy.ndimage`, but that is no longer in the text. I think a separate subpackage for data sets makes sense. What do you think? P.S. If there is already a similar proposal in the mailing list or on github, or any other older mailing list discussions related to this, let me know.

Attachments:

attachment.htm (text/html — 1.2 KB)

Show replies by date

Eric Larson

29 Mar 29 Mar

3:48 p.m.

Sounds like a good plan to me. If others agree, I propose that we make existing (and new) data available in scipy.data for 1.1.0. Maybe a 2-release deprecation warning cycle before removing them from `scipy.misc` (possibly for removing everything if we can move the other remaining things, too)? Eric On Thu, Mar 29, 2018 at 3:43 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

According to the SciPy roadmap (https://github.com/scipy/scip y/blob/master/doc/ROADMAP.rst.txt#misc), `scipy.misc` will eventually be removed. Currently, the combinatorial functions and the image-related operations are all deprecated. The only non-deprecated functions in `misc` are `central_diff_weights()`, `derivative()` and the two functions that return image data: `ascent()` and `face()`.

As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there.

An early version of the roadmap suggested moving the images to `scipy.ndimage`, but that is no longer in the text. I think a separate subpackage for data sets makes sense.

What do you think?

P.S. If there is already a similar proposal in the mailing list or on github, or any other older mailing list discussions related to this, let me know.

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

josef.pktd＠gmail.com

4:06 p.m.

I don't like the name scipy.data and would prefer something more explicit like scipy.datasets. My first reaction to the proposal was that we get now a subpackage for functions to work with data. Like ndimage are functions for images signal are functions for signals spatial are functions for neighborhood and data are functions for data (e.g. like pandas or similar) Josef On Thu, Mar 29, 2018 at 3:48 PM, Eric Larson <larson.eric.d@gmail.com> wrote:

...

Sounds like a good plan to me.

If others agree, I propose that we make existing (and new) data available in scipy.data for 1.1.0. Maybe a 2-release deprecation warning cycle before removing them from `scipy.misc` (possibly for removing everything if we can move the other remaining things, too)?

Eric

On Thu, Mar 29, 2018 at 3:43 PM, Warren Weckesser <warren.weckesser@gmail.com> wrote:

...
According to the SciPy roadmap (https://github.com/scipy/scipy/blob/master/doc/ROADMAP.rst.txt#misc), `scipy.misc` will eventually be removed. Currently, the combinatorial functions and the image-related operations are all deprecated. The only non-deprecated functions in `misc` are `central_diff_weights()`, `derivative()` and the two functions that return image data: `ascent()` and `face()`.

As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there.

An early version of the roadmap suggested moving the images to `scipy.ndimage`, but that is no longer in the text. I think a separate subpackage for data sets makes sense.

What do you think?

P.S. If there is already a similar proposal in the mailing list or on github, or any other older mailing list discussions related to this, let me know.

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Gael Varoquaux

4:38 p.m.

+1 On Thu, Mar 29, 2018 at 04:06:02PM -0400, josef.pktd@gmail.com wrote:

...

I don't like the name scipy.data and would prefer something more explicit like scipy.datasets.

...

My first reaction to the proposal was that we get now a subpackage for functions to work with data.

...

Like ndimage are functions for images signal are functions for signals spatial are functions for neighborhood

...

and data are functions for data (e.g. like pandas or similar)

...

Josef

...

On Thu, Mar 29, 2018 at 3:48 PM, Eric Larson <larson.eric.d@gmail.com> wrote:

...
Sounds like a good plan to me.

...

...
If others agree, I propose that we make existing (and new) data available in scipy.data for 1.1.0. Maybe a 2-release deprecation warning cycle before removing them from `scipy.misc` (possibly for removing everything if we can move the other remaining things, too)?

...

...
Eric

...

...
On Thu, Mar 29, 2018 at 3:43 PM, Warren Weckesser <warren.weckesser@gmail.com> wrote:

...

...
...
According to the SciPy roadmap (https://github.com/scipy/scipy/blob/master/doc/ROADMAP.rst.txt#misc), `scipy.misc` will eventually be removed. Currently, the combinatorial functions and the image-related operations are all deprecated. The only non-deprecated functions in `misc` are `central_diff_weights()`, `derivative()` and the two functions that return image data: `ascent()` and `face()`.

...

...
...
As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there.

...

...
...
An early version of the roadmap suggested moving the images to `scipy.ndimage`, but that is no longer in the text. I think a separate subpackage for data sets makes sense.

...

...
...
What do you think?

...

...
...
P.S. If there is already a similar proposal in the mailing list or on github, or any other older mailing list discussions related to this, let me know.

...

...
...
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

...

...
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

...

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

-- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux

Bennet Fauber

5:08 p.m.

I also agree -- if this is where datasets are to be kept and made available, then 'datasets' is probably a better name. That also agrees with the R package called 'datasets', and so might be considered a more 'customary' name. On Thu, Mar 29, 2018 at 4:38 PM, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:

...

+1

On Thu, Mar 29, 2018 at 04:06:02PM -0400, josef.pktd@gmail.com wrote:

...
I don't like the name scipy.data and would prefer something more explicit like scipy.datasets.

...
My first reaction to the proposal was that we get now a subpackage for functions to work with data.

...
Like ndimage are functions for images signal are functions for signals spatial are functions for neighborhood

...
and data are functions for data (e.g. like pandas or similar)

...
Josef

...
On Thu, Mar 29, 2018 at 3:48 PM, Eric Larson <larson.eric.d@gmail.com> wrote:

...
Sounds like a good plan to me.

...
...
If others agree, I propose that we make existing (and new) data available in scipy.data for 1.1.0. Maybe a 2-release deprecation warning cycle before removing them from `scipy.misc` (possibly for removing everything if we can move the other remaining things, too)?

...
...
Eric

...
...
On Thu, Mar 29, 2018 at 3:43 PM, Warren Weckesser <warren.weckesser@gmail.com> wrote:

...
...
...
According to the SciPy roadmap (https://github.com/scipy/scipy/blob/master/doc/ROADMAP.rst.txt#misc), `scipy.misc` will eventually be removed. Currently, the combinatorial functions and the image-related operations are all deprecated. The only non-deprecated functions in `misc` are `central_diff_weights()`, `derivative()` and the two functions that return image data: `ascent()` and `face()`.

...
...
...
As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there.

...
...
...
An early version of the roadmap suggested moving the images to `scipy.ndimage`, but that is no longer in the text. I think a separate subpackage for data sets makes sense.

...
...
...
What do you think?

...
...
...
P.S. If there is already a similar proposal in the mailing list or on github, or any other older mailing list discussions related to this, let me know.

...
...
...
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

...
...
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

...
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

-- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Lars G.

5:27 p.m.

Is there a reason not to include those function where they'll most likely be used? Meaning the images `ascent` and `face` could move to `scipy.ndimage` and the signal `electrocardiogram` to `scipy.signal`? Lars On 29.03.2018 23:08, Bennet Fauber wrote:

...

I also agree -- if this is where datasets are to be kept and made available, then 'datasets' is probably a better name. That also agrees with the R package called 'datasets', and so might be considered a more 'customary' name.

On Thu, Mar 29, 2018 at 4:38 PM, Gael Varoquaux <gael.varoquaux@normalesup.org> wrote:

...
+1

On Thu, Mar 29, 2018 at 04:06:02PM -0400, josef.pktd@gmail.com wrote:

...
I don't like the name scipy.data and would prefer something more explicit like scipy.datasets.

...
My first reaction to the proposal was that we get now a subpackage for functions to work with data.

...
Like ndimage are functions for images signal are functions for signals spatial are functions for neighborhood

...
and data are functions for data (e.g. like pandas or similar)

...
Josef

Robert Kern

5:34 p.m.

On Thu, Mar 29, 2018 at 2:27 PM, Lars G. <lagru@mailbox.org> wrote:

...

Is there a reason not to include those function where they'll most likely be used? Meaning the images `ascent` and `face` could move to `scipy.ndimage` and the signal `electrocardiogram` to `scipy.signal`?

It would make them harder to discover, at least for me. On the developer side, if everything is in one subpackage, it would be easier to keep track how many bytes are being consumed by data. The scipy.datasets namespace would be a good place to put any common data-loading code (for instance, if we start adding large datasets that will be downloaded upon first use rather than being distributed in the scipy wheel). -- Robert Kern

Bennet Fauber

5:47 p.m.

Would having a single datasets library increase visibility and potentially encourage the use of one dataset for multiple purposes? If they are roughly indexed, as the ones at the R datasets package site are, that could also be helpful for people who are finding their way to analytic capability via the catalog of examples. Someone looking for electorcardiogram might get led to signal that way, if that matters. On Thu, Mar 29, 2018 at 5:34 PM, Robert Kern <robert.kern@gmail.com> wrote:

...

On Thu, Mar 29, 2018 at 2:27 PM, Lars G. <lagru@mailbox.org> wrote:

...
Is there a reason not to include those function where they'll most likely be used? Meaning the images `ascent` and `face` could move to `scipy.ndimage` and the signal `electrocardiogram` to `scipy.signal`?

It would make them harder to discover, at least for me.

On the developer side, if everything is in one subpackage, it would be easier to keep track how many bytes are being consumed by data. The scipy.datasets namespace would be a good place to put any common data-loading code (for instance, if we start adding large datasets that will be downloaded upon first use rather than being distributed in the scipy wheel).

-- Robert Kern

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Ilhan Polat

6:08 p.m.

I agree with Eric: duplicating the existing ones in sp.datasets right away and placing the appropriate deprecation warnings seems like a good way to get rid of it. On Thu, Mar 29, 2018 at 11:47 PM, Bennet Fauber <bennet@umich.edu> wrote:

...

Would having a single datasets library increase visibility and potentially encourage the use of one dataset for multiple purposes? If they are roughly indexed, as the ones at the R datasets package site are, that could also be helpful for people who are finding their way to analytic capability via the catalog of examples. Someone looking for electorcardiogram might get led to signal that way, if that matters.

On Thu, Mar 29, 2018 at 5:34 PM, Robert Kern <robert.kern@gmail.com> wrote:

...
On Thu, Mar 29, 2018 at 2:27 PM, Lars G. <lagru@mailbox.org> wrote:

...
Is there a reason not to include those function where they'll most likely be used? Meaning the images `ascent` and `face` could move to `scipy.ndimage` and the signal `electrocardiogram` to `scipy.signal`?

It would make them harder to discover, at least for me.

On the developer side, if everything is in one subpackage, it would be easier to keep track how many bytes are being consumed by data. The scipy.datasets namespace would be a good place to put any common data-loading code (for instance, if we start adding large datasets that will be downloaded upon first use rather than being distributed in the scipy wheel).

-- Robert Kern

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Warren Weckesser

6:17 p.m.

On Thu, Mar 29, 2018 at 4:06 PM, <josef.pktd@gmail.com> wrote:

...

I don't like the name scipy.data and would prefer something more explicit like scipy.datasets.

`datasets` is fine with me, and based on other responses so far, other folks have either implicitly or explicitly endorsed it, so lets go with it. Warren

...

My first reaction to the proposal was that we get now a subpackage for functions to work with data.

Like ndimage are functions for images signal are functions for signals spatial are functions for neighborhood

and data are functions for data (e.g. like pandas or similar)

Josef

On Thu, Mar 29, 2018 at 3:48 PM, Eric Larson <larson.eric.d@gmail.com> wrote:

...
Sounds like a good plan to me.

If others agree, I propose that we make existing (and new) data available in scipy.data for 1.1.0. Maybe a 2-release deprecation warning cycle before removing them from `scipy.misc` (possibly for removing everything if we can move the other remaining things, too)?

Eric

On Thu, Mar 29, 2018 at 3:43 PM, Warren Weckesser <warren.weckesser@gmail.com> wrote:

...
According to the SciPy roadmap (https://github.com/scipy/scipy/blob/master/doc/ROADMAP.rst.txt#misc), `scipy.misc` will eventually be removed. Currently, the combinatorial functions and the image-related operations are all deprecated. The only non-deprecated functions in `misc` are `central_diff_weights()`, `derivative()` and the two functions that return image data: `ascent()`

and

...
`face()`.

As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there.

An early version of the roadmap suggested moving the images to `scipy.ndimage`, but that is no longer in the text. I think a separate subpackage for data sets makes sense.

What do you think?

P.S. If there is already a similar proposal in the mailing list or on github, or any other older mailing list discussions related to this, let me know.

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Stefan van der Walt

6:45 p.m.

On Thu, 29 Mar 2018 15:43:50 -0400, Warren Weckesser wrote:

...

As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there.

We've been doing this in scikit-image for a long time, and now regret having any binary data in the repository; we are working on a way of hosting it outside instead. Can we standardize on downloader tools? There are examples in scikit-learn, dipy, and many other packages. We were thinking of a very lightweight spec + tools for solving this problem a while ago, but never got very far: https://github.com/data-pack/data-pack/pull/1/files Best regards Stéfan

Hameer Abbasi

6:50 p.m.

I concur with Stefan. Not having datasets in a package seems like the best way to go. There should be a separate go-to place for datasets (other than minimal ones for test cases). I would recommend branching off all datasets... Otherwise we add to Scipy's already significant size. On 30/03/2018 at 00:45, Stefan wrote: On Thu, 29 Mar 2018 15:43:50 -0400, Warren Weckesser wrote: As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there. We've been doing this in scikit-image for a long time, and now regret having any binary data in the repository; we are working on a way of hosting it outside instead. Can we standardize on downloader tools? There are examples in scikit-learn, dipy, and many other packages. We were thinking of a very lightweight spec + tools for solving this problem a while ago, but never got very far: https://github.com/data-pack/data-pack/pull/1/files Best regards Stéfan _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Warren Weckesser

6:54 p.m.

On Thu, Mar 29, 2018 at 6:45 PM, Stefan van der Walt <stefanv@berkeley.edu> wrote:

...

On Thu, 29 Mar 2018 15:43:50 -0400, Warren Weckesser wrote:

...
As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there.

We've been doing this in scikit-image for a long time, and now regret having any binary data in the repository;

Can you summarize the problems that make you regret including the data? Warren

...

we are working on a way of hosting it outside instead.

Can we standardize on downloader tools? There are examples in scikit-learn, dipy, and many other packages. We were thinking of a very lightweight spec + tools for solving this problem a while ago, but never got very far:

https://github.com/data-pack/data-pack/pull/1/files

Best regards Stéfan _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Stefan van der Walt

7:16 p.m.

On Thu, 29 Mar 2018 18:54:52 -0400, Warren Weckesser wrote:

...

Can you summarize the problems that make you regret including the data?

- The size of the repository (extra time on each clone, and that for data that isn't necessary in most use cases) - Artificial limit on data sizes: we now have a default place to store data, but we still need an additional mechanism for larger datasets. How do you choose the threshold for what goes in, what is too big? - Because these tiny embedded datasets are easily available, they become the default for demos. If data is stored externally, realistic examples become more feasible and likely. Best regards Stéfan

josef.pktd＠gmail.com

7:44 p.m.

On Thu, Mar 29, 2018 at 7:16 PM, Stefan van der Walt <stefanv@berkeley.edu> wrote:

...

On Thu, 29 Mar 2018 18:54:52 -0400, Warren Weckesser wrote:

...
Can you summarize the problems that make you regret including the data?

- The size of the repository (extra time on each clone, and that for data that isn't necessary in most use cases)

- Artificial limit on data sizes: we now have a default place to store data, but we still need an additional mechanism for larger datasets. How do you choose the threshold for what goes in, what is too big?

- Because these tiny embedded datasets are easily available, they become the default for demos. If data is stored externally, realistic examples become more feasible and likely.

In statsmodels we included datasets from the beginning both for unit tests and for examples. By today's standard these are almost all tiny datasets. The advantage is that many of them are old textbook dataset that often illustrate a problem that we can run into, while clean random generated data is often boring. Unit test don't have access to the internet on Debian, so there is still the restriction of either using internal data or random data. For notebook we rely now often on downloading from `rdatasets`, or even having the user download a zip file if the license situation is not clear, e.g. downloading from the supplementary material to books. About tools for downloading datasets: We have a helper function to download from rdatasets and a helper function to download Stata files from the internet. Essentially all other datasets are handled by pandas. It's a simpler case for statsmodels because all datasets essentially correspond to a csv file that might be stored in another format. Josef

...

Best regards Stéfan _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Ilhan Polat

7:54 p.m.

Would a separate repo scipy-datasets help ? Then something like try: importing except: warn('I'm off to interwebz') download from the repo might be feasible. The download part can either be that particular dataset or the whole scipy-datasets clone. On Fri, Mar 30, 2018 at 1:16 AM, Stefan van der Walt <stefanv@berkeley.edu> wrote:

...

On Thu, 29 Mar 2018 18:54:52 -0400, Warren Weckesser wrote:

...
Can you summarize the problems that make you regret including the data?

- The size of the repository (extra time on each clone, and that for data that isn't necessary in most use cases)

- Artificial limit on data sizes: we now have a default place to store data, but we still need an additional mechanism for larger datasets. How do you choose the threshold for what goes in, what is too big?

- Because these tiny embedded datasets are easily available, they become the default for demos. If data is stored externally, realistic examples become more feasible and likely.

Best regards Stéfan _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

josef.pktd＠gmail.com

8:03 p.m.

On Thu, Mar 29, 2018 at 7:54 PM, Ilhan Polat <ilhanpolat@gmail.com> wrote:

...

Would a separate repo scipy-datasets help ? Then something like

try: importing except: warn('I'm off to interwebz') download from the repo

might be feasible. The download part can either be that particular dataset or the whole scipy-datasets clone.

IMO: It depends on the scale where this should go. I don't think it's worth it (maintaining and installing another package or repo) for scipy given that scipy is mostly a basic numerical library and not driven by specific applications. For most areas there should be already some online repos or packages and it would be enough to have the accessing functions in scipy.datasets. The only area that I can think of where there might not be some readily available online source for datasets is signal. Josef

...

On Fri, Mar 30, 2018 at 1:16 AM, Stefan van der Walt <stefanv@berkeley.edu> wrote:

...
On Thu, 29 Mar 2018 18:54:52 -0400, Warren Weckesser wrote:

...
Can you summarize the problems that make you regret including the data?

- The size of the repository (extra time on each clone, and that for data that isn't necessary in most use cases)

- Artificial limit on data sizes: we now have a default place to store data, but we still need an additional mechanism for larger datasets. How do you choose the threshold for what goes in, what is too big?

- Because these tiny embedded datasets are easily available, they become the default for demos. If data is stored externally, realistic examples become more feasible and likely.

Best regards Stéfan _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Ilhan Polat

8:10 p.m.

Yes, that's true but GitHub seems like a robust place to live. Otherwise we can just point to any hardcoded URL. But if the size gets bigger in terms of wheels and cloning I think within SciPy doesn't seem to be a viable option. These all depend on what the future of datasets would be. On Fri, Mar 30, 2018 at 2:03 AM, <josef.pktd@gmail.com> wrote:

...

On Thu, Mar 29, 2018 at 7:54 PM, Ilhan Polat <ilhanpolat@gmail.com> wrote:

...
Would a separate repo scipy-datasets help ? Then something like

try: importing except: warn('I'm off to interwebz') download from the repo

might be feasible. The download part can either be that particular dataset or the whole scipy-datasets clone.

IMO:

It depends on the scale where this should go. I don't think it's worth it (maintaining and installing another package or repo) for scipy given that scipy is mostly a basic numerical library and not driven by specific applications.

For most areas there should be already some online repos or packages and it would be enough to have the accessing functions in scipy.datasets. The only area that I can think of where there might not be some readily available online source for datasets is signal.

Josef

...
On Fri, Mar 30, 2018 at 1:16 AM, Stefan van der Walt <

stefanv@berkeley.edu>

...
wrote:

...
On Thu, 29 Mar 2018 18:54:52 -0400, Warren Weckesser wrote:

...
Can you summarize the problems that make you regret including the data?

- The size of the repository (extra time on each clone, and that for data that isn't necessary in most use cases)

- Artificial limit on data sizes: we now have a default place to store data, but we still need an additional mechanism for larger datasets. How do you choose the threshold for what goes in, what is too big?

- Because these tiny embedded datasets are easily available, they become the default for demos. If data is stored externally, realistic examples become more feasible and likely.

Best regards Stéfan _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

josef.pktd＠gmail.com

8:52 p.m.

I also think that at most small datasets should be included in scipy directly. But I think that for online storage scipy would be better off following some other packages. Stefan mentions some attempts to get to a common format. AFAIK without being up to date, both scikit-learn and scikit-image use access to larger datasets. For example, a dataset package also runs into the problem how much to include. I wouldn't install a dataset package with a few gigabyte of data if I'm only interested in a tiny fraction for the examples that are relevant to me. (I'm not into analyzing images, movies or BIG DATA.) Josef On Thu, Mar 29, 2018 at 8:10 PM, Ilhan Polat <ilhanpolat@gmail.com> wrote:

...

Yes, that's true but GitHub seems like a robust place to live. Otherwise we can just point to any hardcoded URL. But if the size gets bigger in terms of wheels and cloning I think within SciPy doesn't seem to be a viable option. These all depend on what the future of datasets would be.

On Fri, Mar 30, 2018 at 2:03 AM, <josef.pktd@gmail.com> wrote:

...
On Thu, Mar 29, 2018 at 7:54 PM, Ilhan Polat <ilhanpolat@gmail.com> wrote:

...
Would a separate repo scipy-datasets help ? Then something like

try: importing except: warn('I'm off to interwebz') download from the repo

might be feasible. The download part can either be that particular dataset or the whole scipy-datasets clone.

IMO:

It depends on the scale where this should go. I don't think it's worth it (maintaining and installing another package or repo) for scipy given that scipy is mostly a basic numerical library and not driven by specific applications.

For most areas there should be already some online repos or packages and it would be enough to have the accessing functions in scipy.datasets. The only area that I can think of where there might not be some readily available online source for datasets is signal.

Josef

...
On Fri, Mar 30, 2018 at 1:16 AM, Stefan van der Walt <stefanv@berkeley.edu> wrote:

...
On Thu, 29 Mar 2018 18:54:52 -0400, Warren Weckesser wrote:

...
Can you summarize the problems that make you regret including the data?

- The size of the repository (extra time on each clone, and that for data that isn't necessary in most use cases)

- Artificial limit on data sizes: we now have a default place to store data, but we still need an additional mechanism for larger datasets. How do you choose the threshold for what goes in, what is too big?

- Because these tiny embedded datasets are easily available, they become the default for demos. If data is stored externally, realistic examples become more feasible and likely.

Best regards Stéfan _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Scott Sievert

9 p.m.

Including some datasets would also help the scipy benchmarks be a more realistic. Right now the benchmarks use synthetic data (at least the signal benchmarks do). Scott On March 29, 2018 at 7:17:28 PM, Ilhan Polat (ilhanpolat@gmail.com) wrote: Yes, that's true but GitHub seems like a robust place to live. Otherwise we can just point to any hardcoded URL. But if the size gets bigger in terms of wheels and cloning I think within SciPy doesn't seem to be a viable option. These all depend on what the future of datasets would be. On Fri, Mar 30, 2018 at 2:03 AM, <josef.pktd@gmail.com> wrote:

...

On Thu, Mar 29, 2018 at 7:54 PM, Ilhan Polat <ilhanpolat@gmail.com> wrote:

...
Would a separate repo scipy-datasets help ? Then something like

try: importing except: warn('I'm off to interwebz') download from the repo

might be feasible. The download part can either be that particular dataset or the whole scipy-datasets clone.

IMO:

It depends on the scale where this should go. I don't think it's worth it (maintaining and installing another package or repo) for scipy given that scipy is mostly a basic numerical library and not driven by specific applications.

For most areas there should be already some online repos or packages and it would be enough to have the accessing functions in scipy.datasets. The only area that I can think of where there might not be some readily available online source for datasets is signal.

Josef

...
On Fri, Mar 30, 2018 at 1:16 AM, Stefan van der Walt <

stefanv@berkeley.edu>

...
wrote:

...
On Thu, 29 Mar 2018 18:54:52 -0400, Warren Weckesser wrote:

...
Can you summarize the problems that make you regret including the data?

- The size of the repository (extra time on each clone, and that for data that isn't necessary in most use cases)

- Artificial limit on data sizes: we now have a default place to store data, but we still need an additional mechanism for larger datasets. How do you choose the threshold for what goes in, what is too big?

- Because these tiny embedded datasets are easily available, they become the default for demos. If data is stored externally, realistic examples become more feasible and likely.

Best regards Stéfan _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Eric Larson

30 Mar 30 Mar

9:54 a.m.

...

It depends on the scale where this should go.

...
In this particular case ("scipy.signal currently has no useful realistic

signals"), if we add the proposed ~100 kB data file, I suspect that we can greatly enhance a large number of our scipy.signal examples. An ECG signal won't be perfect for all of them, but in many cases it will be a lot better and more instructive for users than what we can currently synthesize ourselves (while keeping synthesis sufficiently simple at least). Compared to a general dataset-fetching utility, the in-repo approach has clear disadvantages in terms of being incomplete and adding to repo size. Its advantages are in terms of simplifying doc building, access, maintenance, uniformity of functionality (benchmarks, Debian unit tests, doc building, etc.). On the balance, this makes it worth having IMO. For example, a dataset package also runs into the problem how much to

...

include.

A proposed rule of thumb: SciPy can have (up to) a couple of small-sized files per module shipped with the repo in cases where such files greatly improve our ability to showcase/test/document functionality (benchmarks/unit tests/docstrings). This forces us to make subjective judgments about what will be sufficiently useful, sufficiently small, and sufficiently impactful for the module, but I think this will be a rare enough phenomenon that it's okay. In other words, I propose that scipy.datasets not provide an *exhaustive* or even *extensive *resource of data for users, but rather a *minimal* one for showcasing functionality. This seems consistent with what we already do with ascent/face, in that they improve the image-processing examples. We've been doing this in scikit-image for a long time, and now regret

...

having any binary data in the repository

I have had a similar problem while maintaining MNE-Python, which has some files in the repo and others in a GitHub repo (downloaded separately for testing). I have a similar feeling about the files that live in the repo today. However, for SciPy the problem seems a bit different in scope and scale -- a handful of small files can go a long way for SciPy, which isn't the case for MNE (and I would assume also many functions in scikit-image). both scikit-learn and scikit-image use access to larger datasets. There are other projects that also do this (MNE has huge ones hosted on osf.io, VisPy hosts data on GitHub). It would be awesome if someone unified all this stuff for cases where you want to deal with getting large datasets, or many different datasets. My 2c, Eric

josef.pktd＠gmail.com

10:29 a.m.

On Fri, Mar 30, 2018 at 9:54 AM, Eric Larson <larson.eric.d@gmail.com> wrote:

...

...
...
It depends on the scale where this should go.

In this particular case ("scipy.signal currently has no useful realistic signals"), if we add the proposed ~100 kB data file, I suspect that we can greatly enhance a large number of our scipy.signal examples. An ECG signal won't be perfect for all of them, but in many cases it will be a lot better and more instructive for users than what we can currently synthesize ourselves (while keeping synthesis sufficiently simple at least).

Compared to a general dataset-fetching utility, the in-repo approach has clear disadvantages in terms of being incomplete and adding to repo size. Its advantages are in terms of simplifying doc building, access, maintenance, uniformity of functionality (benchmarks, Debian unit tests, doc building, etc.). On the balance, this makes it worth having IMO.

...
For example, a dataset package also runs into the problem how much to include.

A proposed rule of thumb: SciPy can have (up to) a couple of small-sized files per module shipped with the repo in cases where such files greatly improve our ability to showcase/test/document functionality (benchmarks/unit tests/docstrings). This forces us to make subjective judgments about what will be sufficiently useful, sufficiently small, and sufficiently impactful for the module, but I think this will be a rare enough phenomenon that it's okay.

In other words, I propose that scipy.datasets not provide an exhaustive or even extensive resource of data for users, but rather a minimal one for showcasing functionality. This seems consistent with what we already do with ascent/face, in that they improve the image-processing examples.

...
We've been doing this in scikit-image for a long time, and now regret having any binary data in the repository

I have had a similar problem while maintaining MNE-Python, which has some files in the repo and others in a GitHub repo (downloaded separately for testing). I have a similar feeling about the files that live in the repo today. However, for SciPy the problem seems a bit different in scope and scale -- a handful of small files can go a long way for SciPy, which isn't the case for MNE (and I would assume also many functions in scikit-image).

...
both scikit-learn and scikit-image use access to larger datasets.

There are other projects that also do this (MNE has huge ones hosted on osf.io, VisPy hosts data on GitHub). It would be awesome if someone unified all this stuff for cases where you want to deal with getting large datasets, or many different datasets.

just to say: I agree with all of this,and think it is a very good summary of the issues Josef

...

My 2c, Eric

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Mark Alexander Mikofski

3:23 p.m.

I agree with what's above. Basically (1) move small datasets to centralized scipy.datasets for testing, demos, docs, and short examples, and (2) move large, realistic datasets to shared repo or common site like rdatasets and explain in docs how to retrieve them. These longer tutorials could be in Jupyter notebooks, for example. On Mar 30, 2018 7:30 AM, <josef.pktd@gmail.com> wrote: On Fri, Mar 30, 2018 at 9:54 AM, Eric Larson <larson.eric.d@gmail.com> wrote:

...

...
...
It depends on the scale where this should go.

In this particular case ("scipy.signal currently has no useful realistic signals"), if we add the proposed ~100 kB data file, I suspect that we can greatly enhance a large number of our scipy.signal examples. An ECG signal won't be perfect for all of them, but in many cases it will be a lot better and more instructive for users than what we can currently synthesize ourselves (while keeping synthesis sufficiently simple at least).

Compared to a general dataset-fetching utility, the in-repo approach has clear disadvantages in terms of being incomplete and adding to repo size. Its advantages are in terms of simplifying doc building, access, maintenance, uniformity of functionality (benchmarks, Debian unit tests, doc building, etc.). On the balance, this makes it worth having IMO.

...
For example, a dataset package also runs into the problem how much to include.

A proposed rule of thumb: SciPy can have (up to) a couple of small-sized files per module shipped with the repo in cases where such files greatly improve our ability to showcase/test/document functionality (benchmarks/unit tests/docstrings). This forces us to make subjective judgments about what will be sufficiently useful, sufficiently small, and sufficiently impactful for the module, but I think this will be a rare enough phenomenon that it's okay.

In other words, I propose that scipy.datasets not provide an exhaustive or even extensive resource of data for users, but rather a minimal one for showcasing functionality. This seems consistent with what we already do with ascent/face, in that they improve the image-processing examples.

...
We've been doing this in scikit-image for a long time, and now regret having any binary data in the repository

I have had a similar problem while maintaining MNE-Python, which has some files in the repo and others in a GitHub repo (downloaded separately for testing). I have a similar feeling about the files that live in the repo today. However, for SciPy the problem seems a bit different in scope and scale -- a handful of small files can go a long way for SciPy, which isn't the case for MNE (and I would assume also many functions in scikit-image).

...
both scikit-learn and scikit-image use access to larger datasets.

There are other projects that also do this (MNE has huge ones hosted on osf.io, VisPy hosts data on GitHub). It would be awesome if someone unified all this stuff for cases where you want to deal with getting large datasets, or many different datasets.

just to say: I agree with all of this,and think it is a very good summary of the issues Josef

...

My 2c, Eric

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Pauli Virtanen

12:06 p.m.

Hi, to, 2018-03-29 kello 15:43 -0400, Warren Weckesser kirjoitti:

...

According to the SciPy roadmap ( https://github.com/scipy/scipy/blob/master/doc/ROADMAP.rst.txt#misc), `scipy.misc` will eventually be removed. Currently, the combinatorial functions and the image-related operations are all deprecated. The only non-deprecated functions in `misc` are `central_diff_weights()`, `derivative()` and the two functions that return image data: `ascent()` and `face()`.

As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there.

...

From the rest of the thread, it appears that there is not very clear

At first sight I think that these two functions (and the third one with ECG signal sample) alone would sound more suitable to be placed in `scipy.ndimage` and `scipy.signal`. Top-level module for them alone sounds overkill, and I'm not sure if discoverability alone is enough. picture of what else we would like to put in there. For the downloadable datasets idea, I would not recommend doing anything that requires maintenance. Sorting out hosting issues is boring, and when done on a volunteer basis it always tends to fall on the same chumps. It's not really in the core mission of the project. Note also the (probably mostly forgotten) numpy.DataSource https://docs.scipy.org/doc/numpy/reference/generated/numpy.DataSource.html Pauli

Eric Larson

3:03 p.m.

...

Top-level module for them alone sounds overkill, and I'm not sure if discoverability alone is enough.

Fine by me. And if we follow the idea that these should be added sparingly, we can maintain discoverability without it growing out of hand by populating the See Also sections of each function. Eric

Ralf Gommers

8:17 p.m.

On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d@gmail.com> wrote:

...

Top-level module for them alone sounds overkill, and I'm not sure if

...
discoverability alone is enough.

Fine by me. And if we follow the idea that these should be added sparingly, we can maintain discoverability without it growing out of hand by populating the See Also sections of each function.

I agree with this, the 2 images and 1 ECG signal (to be added) that we have doesn't justify a top-level module. We don't want to grow more than the absolute minimum of datasets. The package is already very large, which is problematic in certain cases. E.g. numpy + scipy still fits in the AWS Lambda limit of 50 MB, but there's not much margin. Ralf

Warren Weckesser

2 Apr 2 Apr

2:50 p.m.

On Fri, Mar 30, 2018 at 8:17 PM, Ralf Gommers <ralf.gommers@gmail.com> wrote:

...

On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d@gmail.com> wrote:

...
Top-level module for them alone sounds overkill, and I'm not sure if

...
discoverability alone is enough.

Fine by me. And if we follow the idea that these should be added sparingly, we can maintain discoverability without it growing out of hand by populating the See Also sections of each function.

I agree with this, the 2 images and 1 ECG signal (to be added) that we have doesn't justify a top-level module. We don't want to grow more than the absolute minimum of datasets. The package is already very large, which is problematic in certain cases. E.g. numpy + scipy still fits in the AWS Lambda limit of 50 MB, but there's not much margin.

Note: this is a reply to the thread, and not specifically to Ralf's comments (but those are included). After reading all the replies, the first question that comes to mind is: should SciPy have *any* datasets? I think this question has already been answered: we have had functions that return images in scipy.misc for a long time, and I don't recall anyone ever suggesting that these be removed. (Well, there was lena(), but I don't think anyone had a problem with adding a replacement image.) And the pull request for the ECG dataset has been merged (added to scipy.misc), so there is current support among the developers for providing datasets. So the remaining questions are: (1) Where do the datasets reside? (2) What are the criteria for adding a new datasets? Here's my 2¢: (1) Where do the datasets reside? My preference is to keep all the datasets in the top-level module scipy.datasets. Robert preferred this module for discoverability, and I agree. By having all the datasets in one place, anyone can easily see what is available. Teachers and others developing educational material know where to find source material for examples. Developers, too, can easily look for examples to use in our docstrings or tutorials. (By the way, adding examples to the docstrings of all functions is an ongoing effort: https://github.com/scipy/scipy/issues/7168.) Also, there are many well-known datasets that could be used as examples for multiple scipy packages. For a concrete example, a dataset that I could see adding to scipy is the Hald cement dataset. SciPy should eventually have an implementation of the PCA decomposition, and it could conceivably live in scipy.linalg. It would be reasonable to use the Hald data in the docstrings of the new PCA function(s) (cf. https://www.mathworks.com/help/stats/pca.html). At the same time, the Hald data could enrich the docstrings of some functions in scipy.stats. Similarly, Fisher's iris dataset provides a well-known example that could be used in docstrings in both scipy.cluster and scipy.stats. (2) What are the criteria for adding a new datasets? So far, the only compelling reason I can see to even have datasets is to have interesting examples in the docstrings (or at least in our tutorials). For example, the docstring for scipy.ndimage.gaussian_filter and several other transformations in ndimage use the image returned by scipy.misc.ascent(): https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.gaussian_... I could see the benefit of having well-known datasets such as Fisher's iris data, the Hald cement data, and some version of a sunspot activity time series, to be used in the docstrings in scipy.stats, scipy.signal, scipy.cluster, scipy.linalg, and elsewhere. Stéfan expressed regret about including datasets in sciki-image. The main issue seems to be "bloat". Scikit-image is an image processing library, so the datasets used there are likely all images, and there is a minimum size for a sample image to be useful as an example. For scipy, we already have two images, and I don't think we'll need more. The newly added ECG dataset is 116K (which is less than the existing image datasets: "ascent.dat" is 515K and "face.dat" is 1.5M). The potential datasets that I mentioned above (Hald, iris, sunspots) are all very small. If we are conservative about what we include, and focus on datasets chosen specifically to demonstrate scipy functionality, we should be able to avoid dataset bloat. This leads to my proposal for the criteria for adding a dataset: (a) Not too big. The size of a dataset should not exceed $MAX (but I don't have a good suggestion for what $MAX should be at the moment). (b) The dataset should be well-known, where "well-known" means that the dataset is one that is already widely used as an example and many people will know it by name (e.g. the iris dataset), or the dataset is a sample of a common signal type or format (e.g an ECG signal, or an image such as misc.ascent). (c) We actually *use* the dataset in one of *our* docstrings or tutorials. I don't think our datasets package should become a repository of interesting scientific data with no connection to the scipy code. Its purpose should be to enrich our documentation. (Note that by this criterion, the recently added ECG signal would not qualify!) To summarize: I'm in favor scipy.datasets, a conservatively curated subpackage containing well-known datasets. Warren P.S. I should add that I'm not in favor putting code in scipy that fetches data from the web. That type of data retrieval could be useful, but it seems more appropriate for a package that is independent of scipy.

...

Ralf

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Charles R Harris

4 Apr 4 Apr

10:54 a.m.

On Mon, Apr 2, 2018 at 12:50 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

On Fri, Mar 30, 2018 at 8:17 PM, Ralf Gommers <ralf.gommers@gmail.com> wrote:

...
On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d@gmail.com> wrote:

...
Top-level module for them alone sounds overkill, and I'm not sure if

...
discoverability alone is enough.

Fine by me. And if we follow the idea that these should be added sparingly, we can maintain discoverability without it growing out of hand by populating the See Also sections of each function.

I agree with this, the 2 images and 1 ECG signal (to be added) that we have doesn't justify a top-level module. We don't want to grow more than the absolute minimum of datasets. The package is already very large, which is problematic in certain cases. E.g. numpy + scipy still fits in the AWS Lambda limit of 50 MB, but there's not much margin.

Note: this is a reply to the thread, and not specifically to Ralf's comments (but those are included).

After reading all the replies, the first question that comes to mind is: should SciPy have *any* datasets?

I think this question has already been answered: we have had functions that return images in scipy.misc for a long time, and I don't recall anyone ever suggesting that these be removed. (Well, there was lena(), but I don't think anyone had a problem with adding a replacement image.) And the pull request for the ECG dataset has been merged (added to scipy.misc), so there is current support among the developers for providing datasets.

So the remaining questions are: (1) Where do the datasets reside? (2) What are the criteria for adding a new datasets?

Here's my 2¢:

(1) Where do the datasets reside?

My preference is to keep all the datasets in the top-level module scipy.datasets. Robert preferred this module for discoverability, and I agree. By having all the datasets in one place, anyone can easily see what is available. Teachers and others developing educational material know where to find source material for examples. Developers, too, can easily look for examples to use in our docstrings or tutorials. (By the way, adding examples to the docstrings of all functions is an ongoing effort: https://github.com/scipy/scipy/issues/7168.)

Also, there are many well-known datasets that could be used as examples for multiple scipy packages. For a concrete example, a dataset that I could see adding to scipy is the Hald cement dataset. SciPy should eventually have an implementation of the PCA decomposition, and it could conceivably live in scipy.linalg. It would be reasonable to use the Hald data in the docstrings of the new PCA function(s) (cf. https://www.mathworks.com/help/stats/pca.html). At the same time, the Hald data could enrich the docstrings of some functions in scipy.stats.

Similarly, Fisher's iris dataset provides a well-known example that could be used in docstrings in both scipy.cluster and scipy.stats.

(2) What are the criteria for adding a new datasets?

So far, the only compelling reason I can see to even have datasets is to have interesting examples in the docstrings (or at least in our tutorials). For example, the docstring for scipy.ndimage.gaussian_filter and several other transformations in ndimage use the image returned by scipy.misc.ascent():

https://docs.scipy.org/doc/scipy/reference/generated/ scipy.ndimage.gaussian_filter.html

I could see the benefit of having well-known datasets such as Fisher's iris data, the Hald cement data, and some version of a sunspot activity time series, to be used in the docstrings in scipy.stats, scipy.signal, scipy.cluster, scipy.linalg, and elsewhere.

Stéfan expressed regret about including datasets in sciki-image. The main issue seems to be "bloat". Scikit-image is an image processing library, so the datasets used there are likely all images, and there is a minimum size for a sample image to be useful as an example. For scipy, we already have two images, and I don't think we'll need more. The newly added ECG dataset is 116K (which is less than the existing image datasets: "ascent.dat" is 515K and "face.dat" is 1.5M). The potential datasets that I mentioned above (Hald, iris, sunspots) are all very small. If we are conservative about what we include, and focus on datasets chosen specifically to demonstrate scipy functionality, we should be able to avoid dataset bloat.

This leads to my proposal for the criteria for adding a dataset:

(a) Not too big. The size of a dataset should not exceed $MAX (but I don't have a good suggestion for what $MAX should be at the moment). (b) The dataset should be well-known, where "well-known" means that the dataset is one that is already widely used as an example and many people will know it by name (e.g. the iris dataset), or the dataset is a sample of a common signal type or format (e.g an ECG signal, or an image such as misc.ascent). (c) We actually *use* the dataset in one of *our* docstrings or tutorials. I don't think our datasets package should become a repository of interesting scientific data with no connection to the scipy code. Its purpose should be to enrich our documentation. (Note that by this criterion, the recently added ECG signal would not qualify!)

To summarize: I'm in favor scipy.datasets, a conservatively curated subpackage containing well-known datasets.

There are also some standard functions used for testing optimization. I wonder if it would be reasonable to make those public? Chuck

Warren Weckesser

11 Apr 11 Apr

2:12 a.m.

On Mon, Apr 2, 2018 at 2:50 PM, Warren Weckesser <warren.weckesser@gmail.com

...

wrote:

...

On Fri, Mar 30, 2018 at 8:17 PM, Ralf Gommers <ralf.gommers@gmail.com> wrote:

...
On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d@gmail.com> wrote:

...
Top-level module for them alone sounds overkill, and I'm not sure if

...
discoverability alone is enough.

Fine by me. And if we follow the idea that these should be added sparingly, we can maintain discoverability without it growing out of hand by populating the See Also sections of each function.

I agree with this, the 2 images and 1 ECG signal (to be added) that we have doesn't justify a top-level module. We don't want to grow more than the absolute minimum of datasets. The package is already very large, which is problematic in certain cases. E.g. numpy + scipy still fits in the AWS Lambda limit of 50 MB, but there's not much margin.

Note: this is a reply to the thread, and not specifically to Ralf's comments (but those are included).

After reading all the replies, the first question that comes to mind is: should SciPy have *any* datasets?

I think this question has already been answered: we have had functions that return images in scipy.misc for a long time, and I don't recall anyone ever suggesting that these be removed. (Well, there was lena(), but I don't think anyone had a problem with adding a replacement image.) And the pull request for the ECG dataset has been merged (added to scipy.misc), so there is current support among the developers for providing datasets.

So the remaining questions are: (1) Where do the datasets reside? (2) What are the criteria for adding a new datasets?

Here's my 2¢:

(1) Where do the datasets reside?

My preference is to keep all the datasets in the top-level module scipy.datasets. Robert preferred this module for discoverability, and I agree. By having all the datasets in one place, anyone can easily see what is available. Teachers and others developing educational material know where to find source material for examples. Developers, too, can easily look for examples to use in our docstrings or tutorials. (By the way, adding examples to the docstrings of all functions is an ongoing effort: https://github.com/scipy/scipy/issues/7168.)

Also, there are many well-known datasets that could be used as examples for multiple scipy packages. For a concrete example, a dataset that I could see adding to scipy is the Hald cement dataset. SciPy should eventually have an implementation of the PCA decomposition, and it could conceivably live in scipy.linalg. It would be reasonable to use the Hald data in the docstrings of the new PCA function(s) (cf. https://www.mathworks.com/help/stats/pca.html). At the same time, the Hald data could enrich the docstrings of some functions in scipy.stats.

Similarly, Fisher's iris dataset provides a well-known example that could be used in docstrings in both scipy.cluster and scipy.stats.

(2) What are the criteria for adding a new datasets?

So far, the only compelling reason I can see to even have datasets is to have interesting examples in the docstrings (or at least in our tutorials). For example, the docstring for scipy.ndimage.gaussian_filter and several other transformations in ndimage use the image returned by scipy.misc.ascent():

https://docs.scipy.org/doc/scipy/reference/generated/ scipy.ndimage.gaussian_filter.html

I could see the benefit of having well-known datasets such as Fisher's iris data, the Hald cement data, and some version of a sunspot activity time series, to be used in the docstrings in scipy.stats, scipy.signal, scipy.cluster, scipy.linalg, and elsewhere.

Stéfan expressed regret about including datasets in sciki-image. The main issue seems to be "bloat". Scikit-image is an image processing library, so the datasets used there are likely all images, and there is a minimum size for a sample image to be useful as an example. For scipy, we already have two images, and I don't think we'll need more. The newly added ECG dataset is 116K (which is less than the existing image datasets: "ascent.dat" is 515K and "face.dat" is 1.5M). The potential datasets that I mentioned above (Hald, iris, sunspots) are all very small. If we are conservative about what we include, and focus on datasets chosen specifically to demonstrate scipy functionality, we should be able to avoid dataset bloat.

This leads to my proposal for the criteria for adding a dataset:

(a) Not too big. The size of a dataset should not exceed $MAX (but I don't have a good suggestion for what $MAX should be at the moment). (b) The dataset should be well-known, where "well-known" means that the dataset is one that is already widely used as an example and many people will know it by name (e.g. the iris dataset), or the dataset is a sample of a common signal type or format (e.g an ECG signal, or an image such as misc.ascent). (c) We actually *use* the dataset in one of *our* docstrings or tutorials. I don't think our datasets package should become a repository of interesting scientific data with no connection to the scipy code. Its purpose should be to enrich our documentation. (Note that by this criterion, the recently added ECG signal would not qualify!)

To summarize: I'm in favor scipy.datasets, a conservatively curated subpackage containing well-known datasets.

In case we can reach agreement on the addition of the 'datasets' subpackage, I've create a pull request that implements it: https://github.com/scipy/scipy/pull/8707 The changes are setup up to be included in 1.1. That would be nice; otherwise we have to deprecate the new misc.electrocardiogram after just one release, when the data files are moved to their new home in 1.2. Warren

...

Warren

P.S. I should add that I'm not in favor putting code in scipy that fetches data from the web. That type of data retrieval could be useful, but it seems more appropriate for a package that is independent of scipy.

...
Ralf

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Pauli Virtanen

4:27 a.m.

ke, 2018-04-11 kello 02:12 -0400, Warren Weckesser kirjoitti: [clip]

...

In case we can reach agreement on the addition of the 'datasets' subpackage, I've create a pull request that implements it: https://github.com/scipy/scipy/pull/8707 The changes are setup up to be included in 1.1. That would be nice; otherwise we have to deprecate the new misc.electrocardiogram after just one release, when the data files are moved to their new home in 1.2.

It's not going to be in 1.1, as we don't have agreement to include it yet, and it does not appear to be urgent. That electrocardiogram would need to move is probably not a big issue in practice given that it's a new addition. Pauli

Ralf Gommers

29 Apr 29 Apr

1:45 a.m.

On Mon, Apr 2, 2018 at 11:50 AM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

On Fri, Mar 30, 2018 at 8:17 PM, Ralf Gommers <ralf.gommers@gmail.com> wrote:

...
On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d@gmail.com> wrote:

...
Top-level module for them alone sounds overkill, and I'm not sure if

...
discoverability alone is enough.

Fine by me. And if we follow the idea that these should be added sparingly, we can maintain discoverability without it growing out of hand by populating the See Also sections of each function.

I agree with this, the 2 images and 1 ECG signal (to be added) that we have doesn't justify a top-level module. We don't want to grow more than the absolute minimum of datasets. The package is already very large, which is problematic in certain cases. E.g. numpy + scipy still fits in the AWS Lambda limit of 50 MB, but there's not much margin.

Note: this is a reply to the thread, and not specifically to Ralf's comments (but those are included).

After reading all the replies, the first question that comes to mind is: should SciPy have *any* datasets?

I think this question has already been answered: we have had functions that return images in scipy.misc for a long time, and I don't recall anyone ever suggesting that these be removed. (Well, there was lena(), but I don't think anyone had a problem with adding a replacement image.) And the pull request for the ECG dataset has been merged (added to scipy.misc), so there is current support among the developers for providing datasets.

So the remaining questions are: (1) Where do the datasets reside? (2) What are the criteria for adding a new datasets?

Here's my 2¢:

(1) Where do the datasets reside?

My preference is to keep all the datasets in the top-level module scipy.datasets. Robert preferred this module for discoverability, and I agree. By having all the datasets in one place, anyone can easily see what is available. Teachers and others developing educational material know where to find source material for examples. Developers, too, can easily look for examples to use in our docstrings or tutorials. (By the way, adding examples to the docstrings of all functions is an ongoing effort: https://github.com/scipy/scipy/issues/7168.)

Also, there are many well-known datasets that could be used as examples for multiple scipy packages. For a concrete example, a dataset that I could see adding to scipy is the Hald cement dataset. SciPy should eventually have an implementation of the PCA decomposition, and it could conceivably live in scipy.linalg. It would be reasonable to use the Hald data in the docstrings of the new PCA function(s) (cf. https://www.mathworks.com/help/stats/pca.html). At the same time, the Hald data could enrich the docstrings of some functions in scipy.stats.

Similarly, Fisher's iris dataset provides a well-known example that could be used in docstrings in both scipy.cluster and scipy.stats.

(2) What are the criteria for adding a new datasets?

So far, the only compelling reason I can see to even have datasets is to have interesting examples in the docstrings (or at least in our tutorials). For example, the docstring for scipy.ndimage.gaussian_filter and several other transformations in ndimage use the image returned by scipy.misc.ascent():

https://docs.scipy.org/doc/scipy/reference/generated/ scipy.ndimage.gaussian_filter.html

I could see the benefit of having well-known datasets such as Fisher's iris data, the Hald cement data, and some version of a sunspot activity time series, to be used in the docstrings in scipy.stats, scipy.signal, scipy.cluster, scipy.linalg, and elsewhere.

Stéfan expressed regret about including datasets in sciki-image. The main issue seems to be "bloat". Scikit-image is an image processing library, so the datasets used there are likely all images, and there is a minimum size for a sample image to be useful as an example. For scipy, we already have two images, and I don't think we'll need more. The newly added ECG dataset is 116K (which is less than the existing image datasets: "ascent.dat" is 515K and "face.dat" is 1.5M). The potential datasets that I mentioned above (Hald, iris, sunspots) are all very small. If we are conservative about what we include, and focus on datasets chosen specifically to demonstrate scipy functionality, we should be able to avoid dataset bloat.

This leads to my proposal for the criteria for adding a dataset:

(a) Not too big. The size of a dataset should not exceed $MAX (but I don't have a good suggestion for what $MAX should be at the moment). (b) The dataset should be well-known, where "well-known" means that the dataset is one that is already widely used as an example and many people will know it by name (e.g. the iris dataset), or the dataset is a sample of a common signal type or format (e.g an ECG signal, or an image such as misc.ascent). (c) We actually *use* the dataset in one of *our* docstrings or tutorials. I don't think our datasets package should become a repository of interesting scientific data with no connection to the scipy code. Its purpose should be to enrich our documentation. (Note that by this criterion, the recently added ECG signal would not qualify!)

I'd add the criterion that we should *only* use any dataset in the docs. Hence there are zero internal imports, and the whole datasets submodule can then very simply be stripped for space-constrained usage scenarios. (in those cases a separate package would help even)

...

To summarize: I'm in favor scipy.datasets, a conservatively curated subpackage containing well-known datasets.

This rationale and the implementation in https://github.com/scipy/scipy/pull/8707 is fairly convincing. I'll change my vote to +0.5 for scipy.datasets

...

Warren

P.S. I should add that I'm not in favor putting code in scipy that fetches data from the web. That type of data retrieval could be useful, but it seems more appropriate for a package that is independent of scipy.

+1 Ralf

Robert Kern

2:21 a.m.

On Sat, Apr 28, 2018 at 10:46 PM Ralf Gommers <ralf.gommers@gmail.com> wrote:

...

On Mon, Apr 2, 2018 at 11:50 AM, Warren Weckesser <

...

...
(c) We actually *use* the dataset in one of *our* docstrings or tutorials. I don't think our datasets package should become a repository of interesting scientific data with no connection to the scipy code. Its

warren.weckesser@gmail.com> wrote: purpose should be to enrich our documentation. (Note that by this criterion, the recently added ECG signal would not qualify!)

...

I'd add the criterion that we should *only* use any dataset in the docs.

Hence there are zero internal imports, and the whole datasets submodule can then very simply be stripped for space-constrained usage scenarios. (in those cases a separate package would help even) I believe that one of the motivations for adding the ECG dataset was to make some of the scipy.signal unit tests more realistic. Is that something you'd like to forbid? On the one hand, if you're strapped for space, you probably want to remove the test suites as well. On the other hand, you do want to be able to test your stripped installation! -- Robert Kern

Ralf Gommers

2:41 a.m.

On Sat, Apr 28, 2018 at 11:21 PM, Robert Kern <robert.kern@gmail.com> wrote:

...

On Sat, Apr 28, 2018 at 10:46 PM Ralf Gommers <ralf.gommers@gmail.com> wrote:

...
On Mon, Apr 2, 2018 at 11:50 AM, Warren Weckesser <

warren.weckesser@gmail.com> wrote:

...
...
(c) We actually *use* the dataset in one of *our* docstrings or tutorials. I don't think our datasets package should become a repository of interesting scientific data with no connection to the scipy code. Its purpose should be to enrich our documentation. (Note that by this criterion, the recently added ECG signal would not qualify!)

I'd add the criterion that we should *only* use any dataset in the docs. Hence there are zero internal imports, and the whole datasets submodule can then very simply be stripped for space-constrained usage scenarios. (in those cases a separate package would help even)

I believe that one of the motivations for adding the ECG dataset was to make some of the scipy.signal unit tests more realistic. Is that something you'd like to forbid? On the one hand, if you're strapped for space, you probably want to remove the test suites as well. On the other hand, you do want to be able to test your stripped installation!

Hmm, tough question. Ideally I'd like to say yes, however we do need test data in some cases. In practice I think one would want to strip the test suite anyway; scipy/special/tests/data/*.npz is over 1 MB already. So let's say that importing from within tests is okay. Ralf

Lars G.

4:13 a.m.

On 29.04.2018 08:21, Robert Kern wrote:

...

I believe that one of the motivations for adding the ECG dataset was to make some of the scipy.signal unit tests more realistic. Is that something you'd like to forbid? On the one hand, if you're strapped for space, you probably want to remove the test suites as well. On the other hand, you do want to be able to test your stripped installation!

Right now, the ECG dataset is not used in unit tests. Do you perhaps mean the benchmark suite? That could be another use case for datasets within SciPy (e.g #8769) Best regards, Lars

Robert Kern

11:10 p.m.

On Sun, Apr 29, 2018 at 1:14 AM Lars G. <lagru@mailbox.org> wrote:

...

On 29.04.2018 08:21, Robert Kern wrote:

...
I believe that one of the motivations for adding the ECG dataset was to make some of the scipy.signal unit tests more realistic. Is that something you'd like to forbid? On the one hand, if you're strapped for space, you probably want to remove the test suites as well. On the other hand, you do want to be able to test your stripped installation!

Right now, the ECG dataset is not used in unit tests. Do you perhaps mean the benchmark suite?

Yes, I misremembered the comment about making the benchmarks more realistic as making the tests more realistic. Nonetheless, my point remains: a fairly reasonable motivation for have datasets are the unit tests, as statsmodels does. -- Robert Kern

Andrew Nelson

11:21 p.m.

There's lots of possible datasets that one can add. For example, it might be worthwhile providing the NIST Statistical Reference Datasets, such as those for non-linear regression. On 30 April 2018 at 13:10, Robert Kern <robert.kern@gmail.com> wrote:

...

On Sun, Apr 29, 2018 at 1:14 AM Lars G. <lagru@mailbox.org> wrote:

...
On 29.04.2018 08:21, Robert Kern wrote:

...
I believe that one of the motivations for adding the ECG dataset was to make some of the scipy.signal unit tests more realistic. Is that something you'd like to forbid? On the one hand, if you're strapped for space, you probably want to remove the test suites as well. On the

other

...
...
hand, you do want to be able to test your stripped installation!

Right now, the ECG dataset is not used in unit tests. Do you perhaps mean the benchmark suite?

Yes, I misremembered the comment about making the benchmarks more realistic as making the tests more realistic. Nonetheless, my point remains: a fairly reasonable motivation for have datasets are the unit tests, as statsmodels does.

-- Robert Kern

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

-- _____________________________________ Dr. Andrew Nelson _____________________________________

Daπid

3 Apr 3 Apr

4:06 a.m.

On 31 March 2018 at 02:17, Ralf Gommers <ralf.gommers@gmail.com> wrote:

...

On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d@gmail.com> wrote:

...
Top-level module for them alone sounds overkill, and I'm not sure if

...
discoverability alone is enough.

Fine by me. And if we follow the idea that these should be added sparingly, we can maintain discoverability without it growing out of hand by populating the See Also sections of each function.

I agree with this, the 2 images and 1 ECG signal (to be added) that we have doesn't justify a top-level module. We don't want to grow more than the absolute minimum of datasets. The package is already very large, which is problematic in certain cases. E.g. numpy + scipy still fits in the AWS Lambda limit of 50 MB, but there's not much margin.

The biggest subpackage is sparse, and there most of the space is taken by _ sparsetools.cpython-35m-x86_64-linux-gnu.so According to size -A -d, the biggest sections are debug. The same goes for the second biggest, special. Can it run without those sections? On preliminary checks, it seems that stripping .debug_info and .debug_loc trim down the size from 38 to 3.7 MB, and the test suite still passes. If we really need to trim down the size for installing in things like Lambda, could we have a scipy-lite for production environments, that is the same as scipy but without unnecessary debug? I imagine tracebacks would not be as informative, but that shouldn't matter for production environments. My first thought was to remove docstrings, comments, tests, and data, but maybe they don't amount to so much for the trouble. On the topic at hand, I would agree to having a few, small datasets to showcase functionality. I think a few kilobytes can go a long way to show and benchmark. As far as I can see, a top level module is free: it wouldn't add any maintenance burden, and would make them easier to find. /David.

Ralf Gommers

29 Apr 29 Apr

1:58 a.m.

On Tue, Apr 3, 2018 at 1:06 AM, Daπid <davidmenhur@gmail.com> wrote:

...

On 31 March 2018 at 02:17, Ralf Gommers <ralf.gommers@gmail.com> wrote:

...
On Fri, Mar 30, 2018 at 12:03 PM, Eric Larson <larson.eric.d@gmail.com> wrote:

...
Top-level module for them alone sounds overkill, and I'm not sure if

...
discoverability alone is enough.

Fine by me. And if we follow the idea that these should be added sparingly, we can maintain discoverability without it growing out of hand by populating the See Also sections of each function.

I agree with this, the 2 images and 1 ECG signal (to be added) that we have doesn't justify a top-level module. We don't want to grow more than the absolute minimum of datasets. The package is already very large, which is problematic in certain cases. E.g. numpy + scipy still fits in the AWS Lambda limit of 50 MB, but there's not much margin.

The biggest subpackage is sparse, and there most of the space is taken by _ sparsetools.cpython-35m-x86_64-linux-gnu.so According to size -A -d, the biggest sections are debug. The same goes for the second biggest, special. Can it run without those sections? On preliminary checks, it seems that stripping .debug_info and .debug_loc trim down the size from 38 to 3.7 MB, and the test suite still passes.

Should work. That's a lot more gain than I'd realized. Given that we hardly ever get useful gdb tracebacks, it may be worth considering doing that for releases.

...

If we really need to trim down the size for installing in things like Lambda, could we have a scipy-lite for production environments, that is the same as scipy but without unnecessary debug? I imagine tracebacks would not be as informative, but that shouldn't matter for production environments. My first thought was to remove docstrings, comments, tests, and data, but maybe they don't amount to so much for the trouble.

Recipes for such things are floating around, and it makes sense to do that. I'd rather not maintain an official scipy-lite package though, rather just make choices within scipy that enable third parties to do that. Ralf

...

On the topic at hand, I would agree to having a few, small datasets to showcase functionality. I think a few kilobytes can go a long way to show and benchmark. As far as I can see, a top level module is free: it wouldn't add any maintenance burden, and would make them easier to find.

/David.

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Pauli Virtanen

11:51 a.m.

la, 2018-04-28 kello 22:58 -0700, Ralf Gommers kirjoitti: [clip]

...

...
The biggest subpackage is sparse, and there most of the space is taken by _ sparsetools.cpython-35m-x86_64-linux-gnu.so According to size -A -d, the biggest sections are debug. The same goes for the second biggest, special. Can it run without those sections? On preliminary checks, it seems that stripping .debug_info and .debug_loc trim down the size from 38 to 3.7 MB, and the test suite still passes.

Should work. That's a lot more gain than I'd realized. Given that we hardly ever get useful gdb tracebacks, it may be worth considering doing that for releases.

I thought we already enabled stripping debug symbols for 1.1.0rc1, https://github.com/MacPython/scipy-wheels/pull/26 but this indeed doesn't seem to be the case. There's still time to fix it for 1.1.0 if we figure it out soon. Pauli

Pauli Virtanen

1:13 p.m.

su, 2018-04-29 kello 17:51 +0200, Pauli Virtanen kirjoitti: [clip]

...

I thought we already enabled stripping debug symbols for 1.1.0rc1, https://github.com/MacPython/scipy-wheels/pull/26 but this indeed doesn't seem to be the case.

There's still time to fix it for 1.1.0 if we figure it out soon.

Ok, I was looking at wrong files. This is indeed already fixed in 1.1.0rc1, so no further action needed: $ wget https://files.pythonhosted.org/packages/d3/06/1517f6b3fbdbefaa7ccf31628d3e19... $ unzip scipy-1.1.0rc1-cp35-cp35m-manylinux1_x86_64.whl $ du -csh scipy/sparse/_sparsetools*.so 3,3M scipy/sparse/_sparsetools.cpython-35m-x86_64-linux-gnu.so $ size -A -d scipy/sparse/_sparsetools*.so scipy/sparse/_sparsetools.cpython-35m-x86_64-linux-gnu.so : section size addr .note.gnu.build-id 36 400 .gnu.hash 36 440 .dynsym 1464 480 .dynstr 1203 1944 .gnu.version 122 3148 .gnu.version_r 144 3272 .rela.dyn 4200 3416 .rela.plt 864 7616 .init 14 8480 .plt 592 8496 .text 3037580 9088 .fini 9 3046668 .rodata 5132 3046688 .eh_frame_hdr 21956 3051820 .eh_frame 261124 3073776 .gcc_except_table 52170 3334900 .init_array 8 5484544 .fini_array 8 5484552 .jcr 8 5484560 .data.rel.ro 8 5484568 .dynamic 512 5484576 .got 168 5485088 .got.plt 312 5485256 .data 2520 5485568 .bss 24 5488096 .comment 137 0 Total 3390351 Pauli

Ralf Gommers

2:21 p.m.

On Sun, Apr 29, 2018 at 10:13 AM, Pauli Virtanen <pav@iki.fi> wrote:

...

su, 2018-04-29 kello 17:51 +0200, Pauli Virtanen kirjoitti: [clip]

...
I thought we already enabled stripping debug symbols for 1.1.0rc1, https://github.com/MacPython/scipy-wheels/pull/26 but this indeed doesn't seem to be the case.

There's still time to fix it for 1.1.0 if we figure it out soon.

Ok, I was looking at wrong files. This is indeed already fixed in 1.1.0rc1, so no further action needed:

Interesting, I had managed to miss or forget about that. Looks like it's only partially done though; a quick check on the wheel linked below shows that the unzipped wheel is still 113 MB and further gains are possible. E.g. $ ls -l cython_special.cpython-35m-x86_64-linux-gnu.so ... 10694424 Apr 15 15:04 cython_special.cpython-35m-x86_64-linux-gnu.so $ strip cython_special.cpython-35m-x86_64-linux-gnu.so $ ls -l cython_special.cpython-35m-x86_64-linux-gnu.so ... 3065680 Apr 29 11:16 cython_special.cpython-35m-x86_64-linux-gnu.so $ cd ../linalg $ ls -l _flapack.cpython-35m-x86_64-linux-gnu.so ... 3445392 Apr 15 15:04 _flapack.cpython-35m-x86_64-linux-gnu.so $ strip _flapack.cpython-35m-x86_64-linux-gnu.so $ ls -l _flapack.cpython-35m-x86_64-linux-gnu.so ... 1333960 Apr 29 11:17 _flapack.cpython-35m-x86_64-linux-gnu.so I'll open an issue for that. Ralf

...

$ wget https://files.pythonhosted.org/packages/d3/06/ 1517f6b3fbdbefaa7ccf31628d3e19530d058aea4abeb6685dcb3dce1733 /scipy-1.1.0rc1-cp35-cp35m-manylinux1_x86_64.whl

Pauli Virtanen

3:25 p.m.

su, 2018-04-29 kello 11:21 -0700, Ralf Gommers kirjoitti:

...

$ ls -l cython_special.cpython-35m-x86_64-linux-gnu.so ... 10694424 Apr 15 15:04 cython_special.cpython-35m-x86_64-linux- gnu.so $ strip cython_special.cpython-35m-x86_64-linux-gnu.so $ ls -l cython_special.cpython-35m-x86_64-linux-gnu.so ... 3065680 Apr 29 11:16 cython_special.cpython-35m-x86_64-linux- gnu.so $ cd ../linalg $ ls -l _flapack.cpython-35m-x86_64-linux-gnu.so ... 3445392 Apr 15 15:04 _flapack.cpython-35m-x86_64-linux-gnu.so $ strip _flapack.cpython-35m-x86_64-linux-gnu.so $ ls -l _flapack.cpython-35m-x86_64-linux-gnu.so ... 1333960 Apr 29 11:17 _flapack.cpython-35m-x86_64-linux-gnu.so

I'll open an issue for that.

Right, the link commands appear different: /opt/rh/devtoolset-2/root/usr/bin/gfortran -Wall -g -Wall -g -shared build/temp.linux-x86_64-3.6/scipy/special/cython_special.o build/temp.linux-x86_64-3.6/scipy/special/sf_error.o build/temp.linux-x86_64-3.6/build/src.linux-x86_64-3.6/scipy/special/_logit.o build/temp.linux-x86_64-3.6/scipy/special/amos_wrappers.o build/temp.linux-x86_64-3.6/scipy/special/cdf_wrappers.o build/temp.linux-x86_64-3.6/scipy/special/specfun_wrappers.o -L/usr/local/lib -L/opt/_internal/cpython-3.6.4/lib/python3.6/site-packages/numpy/core/lib -Lbuild/temp.linux-x86_64-3.6 -lopenblas -lopenblas -lsc_amos -lsc_c_misc -lsc_cephes -lsc_mach -lsc_cdf -lsc_specfun -lnpymath -lm -lgfortran -o build/lib.linux-x86_64-3.6/scipy/special/cython_special.cpython-36m-x86_64-linux-gnu.so -Wl,--version-script=build/temp.linux-x86_64-3.6/link-version-scipy.special.cython_special.map vs. gcc -pthread -shared -Wl,-strip-all -L/usr/local/include build/temp.linux-x86_64-3.6/scipy/special/_comb.o -Lbuild/temp.linux-x86_64-3.6 -o build/lib.linux-x86_64-3.6/scipy/special/_comb.cpython-36m-x86_64-linux-gnu.so -Wl,--version-script=build/temp.linux-x86_64-3.6/link-version-scipy.special._comb.map I guess this is some setuptools issue, with fortran and c/c++ handled differently. FFLAGS is supposed to be set in config.sh, but maybe they are not used for linking. Maybe the strip flags should be added to LDFLAGS too. Pauli

Lars G.

30 Mar 30 Mar

3:04 p.m.

On 30.03.2018 18:06, Pauli Virtanen wrote:

...

At first sight I think that these two functions (and the third one with ECG signal sample) alone would sound more suitable to be placed in `scipy.ndimage` and `scipy.signal`. Top-level module for them alone sounds overkill, and I'm not sure if discoverability alone is enough.

At the risk of stating the obvious, wouldn't the discoverability of those functions be pretty high considering these datasets are or could be used in many documentation examples and tutorials? Lars

Phillip Feldman

3:20 p.m.

Is there a substitute for the deprecated combinatorics functions? FYI, a few years ago, I wrote a Python combinatorics package to fill in some of the gaps in `itertools`. My package can be found here: https://pypi.python.org/pypi/Combinatorics/1.4.5 Phillip On Thu, Mar 29, 2018 at 12:43 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...

According to the SciPy roadmap (https://github.com/scipy/ scipy/blob/master/doc/ROADMAP.rst.txt#misc), `scipy.misc` will eventually be removed. Currently, the combinatorial functions and the image-related operations are all deprecated. The only non-deprecated functions in `misc` are `central_diff_weights()`, `derivative()` and the two functions that return image data: `ascent()` and `face()`.

As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there.

An early version of the roadmap suggested moving the images to `scipy.ndimage`, but that is no longer in the text. I think a separate subpackage for data sets makes sense.

What do you think?

P.S. If there is already a similar proposal in the mailing list or on github, or any other older mailing list discussions related to this, let me know.

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

Pauli Virtanen

3:23 p.m.

pe, 2018-03-30 kello 12:20 -0700, Phillip Feldman kirjoitti:

...

Is there a substitute for the deprecated combinatorics functions?

I think they were moved to scipy.special. The documentation iirc says where to find it now. Pauli

...

FYI, a few years ago, I wrote a Python combinatorics package to fill in some of the gaps in `itertools`. My package can be found here:

https://pypi.python.org/pypi/Combinatorics/1.4.5

Phillip

On Thu, Mar 29, 2018 at 12:43 PM, Warren Weckesser < warren.weckesser@gmail.com> wrote:

...
According to the SciPy roadmap (https://github.com/scipy/ scipy/blob/master/doc/ROADMAP.rst.txt#misc), `scipy.misc` will eventually be removed. Currently, the combinatorial functions and the image-related operations are all deprecated. The only non-deprecated functions in `misc` are `central_diff_weights()`, `derivative()` and the two functions that return image data: `ascent()` and `face()`.

As a steps towards the deprecation of `misc`, I propose that we create a new package, `scipy.data`, for holding data sets. `ascent()` and `face()` would move there, and the new ECG data set proposed in a current pull request (https://github.com/scipy/scipy/pull/8627) would be put there.

An early version of the roadmap suggested moving the images to `scipy.ndimage`, but that is no longer in the text. I think a separate subpackage for data sets makes sense.

What do you think?

P.S. If there is already a similar proposal in the mailing list or on github, or any other older mailing list discussions related to this, let me know.

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

_______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

2210

Age (days ago)

2242

Last active (days ago)

List overview

Download

44 comments

18 participants

participants (18)

Andrew Nelson
Bennet Fauber
Charles R Harris
Daπid
Eric Larson
Gael Varoquaux
Hameer Abbasi
Ilhan Polat
josef.pktd＠gmail.com
Lars G.
Mark Alexander Mikofski
Pauli Virtanen
Phillip Feldman
Ralf Gommers
Robert Kern
Scott Sievert
Stefan van der Walt
Warren Weckesser

New subpackage: scipy.data

Scott Sievert

tags

participants (18)