[SciPy-User] scipy central comments

Thu Sep 8 12:52:17 EDT 2011

Back on list.

On Thu, Sep 8, 2011 at 12:43 PM, denis <denis-bz-gg at t-online.de> wrote:
> Skipper,
>  re central data: definitely useful -- see R,
> but it should be separate from scipy-central:
>  don't do everything at once.

We have

https://github.com/statsmodels/statsmodels/tree/master/scikits/statsmodels/datasets

Many of the same datasets are available in R. Could be made available
as a separate package, though some of it (endog, exog) attributes are
specific to our testing and examples needs.

> The functionality oughta include
>    listing: what's available, how big is it ?
>    load / loadtxt to a single array
>    splitting, sanitizing, summarizing: diverse, difficult
>
> BUT scipy-central-data may satisfy no one, in which case forget it.
> (Personally I'd like to spec it first, shoot later, but.)
>

What we've used:
http://statsmodels.sourceforge.net/devel/dataset_proposal.html#dataset-proposal

> What I use today is this, ~ 3 pages:
> def getdata( source, N=0, Ntest=0, classcol=-1, centre=0, verbose=0,
> datadir=Datadir ):
> """ X = getdata( slearn/xx uciml/yy ... classcol=None )
>        findfile, load or loadtxt
>    X, classes = getdata( ... classcol = 0 or -1 )
>        split off classes, astype(int)
>    X, y, Xtest, ytest = getdata( Ntest > 0 )
>        split first N / last Ntest
>    centre:
>        0 noop, 1 -= mean, 2 /= sd, 3 winsorise, 4 winsor + to_11
>
> def findfile( filename, datadir ):
> """ try datadir + filename + .npy .csv .csv.gz .txt .txt.gz
>    expand $vars, glob  # cf openplus
>
> Fwiw,
> http://stackoverflow.com/questions/6321476/python-api-to-load-various-machine-learning-datasets
> got no satisfactory answer
> but you might ask the scikits-learn guys again, see where they are
> today.
>

Ours and their datasets module evolved from David C.'s original
proposal I believe.

Skipper