[SciPy-User] scipy central comments
Skipper Seabold
jsseabold at gmail.com
Thu Sep 8 12:52:17 EDT 2011
Back on list.
On Thu, Sep 8, 2011 at 12:43 PM, denis <denis-bz-gg at t-online.de> wrote:
> Skipper,
> re central data: definitely useful -- see R,
> but it should be separate from scipy-central:
> don't do everything at once.
We have
https://github.com/statsmodels/statsmodels/tree/master/scikits/statsmodels/datasets
Many of the same datasets are available in R. Could be made available
as a separate package, though some of it (endog, exog) attributes are
specific to our testing and examples needs.
> The functionality oughta include
> listing: what's available, how big is it ?
> load / loadtxt to a single array
> splitting, sanitizing, summarizing: diverse, difficult
>
> BUT scipy-central-data may satisfy no one, in which case forget it.
> (Personally I'd like to spec it first, shoot later, but.)
>
What we've used:
http://statsmodels.sourceforge.net/devel/dataset_proposal.html#dataset-proposal
> What I use today is this, ~ 3 pages:
> def getdata( source, N=0, Ntest=0, classcol=-1, centre=0, verbose=0,
> datadir=Datadir ):
> """ X = getdata( slearn/xx uciml/yy ... classcol=None )
> findfile, load or loadtxt
> X, classes = getdata( ... classcol = 0 or -1 )
> split off classes, astype(int)
> X, y, Xtest, ytest = getdata( Ntest > 0 )
> split first N / last Ntest
> centre:
> 0 noop, 1 -= mean, 2 /= sd, 3 winsorise, 4 winsor + to_11
>
> def findfile( filename, datadir ):
> """ try datadir + filename + .npy .csv .csv.gz .txt .txt.gz
> expand $vars, glob # cf openplus
>
> Fwiw,
> http://stackoverflow.com/questions/6321476/python-api-to-load-various-machine-learning-datasets
> got no satisfactory answer
> but you might ask the scikits-learn guys again, see where they are
> today.
>
Ours and their datasets module evolved from David C.'s original
proposal I believe.
Skipper
More information about the SciPy-User
mailing list