It depends on the scale where this should go.
In this particular case ("scipy.signal currently has no useful realistic signals"), if we add the proposed ~100 kB data file, I suspect that we can greatly enhance a large number of our scipy.signal examples. An ECG signal won't be perfect for all of them, but in many cases it will be a lot better and more instructive for users than what we can currently synthesize ourselves (while keeping synthesis sufficiently simple at least).

Compared to a general dataset-fetching utility, the in-repo approach has clear disadvantages in terms of being incomplete and adding to repo size. Its advantages are in terms of simplifying doc building, access, maintenance, uniformity of functionality (benchmarks, Debian unit tests, doc building, etc.). On the balance, this makes it worth having IMO.

For example, a dataset package also runs into the problem how much to include.

A proposed rule of thumb: SciPy can have (up to) a couple of small-sized files per module shipped with the repo in cases where such files greatly improve our ability to showcase/test/document functionality (benchmarks/unit tests/docstrings). This forces us to make subjective judgments about what will be sufficiently useful, sufficiently small, and sufficiently impactful for the module, but I think this will be a rare enough phenomenon that it's okay.

In other words, I propose that scipy.datasets not provide an exhaustive or even extensive resource of data for users, but rather a minimal one for showcasing functionality. This seems consistent with what we already do with ascent/face, in that they improve the image-processing examples.

We've been doing this in scikit-image for a long time, and now regret having any binary data in the repository

I have had a similar problem while maintaining MNE-Python, which has some files in the repo and others in a GitHub repo (downloaded separately for testing). I have a similar feeling about the files that live in the repo today. However, for SciPy the problem seems a bit different in scope and scale -- a handful of small files can go a long way for SciPy, which isn't the case for MNE (and I would assume also many functions in scikit-image).

both scikit-learn and scikit-image use access to larger datasets.

There are other projects that also do this (MNE has huge ones hosted on osf.io, VisPy hosts data on GitHub). It would be awesome if someone unified all this stuff for cases where you want to deal with getting large datasets, or many different datasets.

My 2c,
Eric