[SciPy-User] [ANN] scikit.statsmodels 0.2.0 release

Fri Feb 19 12:45:04 EST 2010

On Fri, Feb 19, 2010 at 12:19 PM, Gael Varoquaux
<gael.varoquaux at normalesup.org> wrote:
> On Fri, Feb 19, 2010 at 10:57:01AM -0600, Bruce Southey wrote:
>> Will it end up as cython?
>
> I am trying to convince the engineer who is doing the work to go down
> that way but he does like cython. I am hesitent to impose my point of
> view to a highly qualified engineer, but I don't like having this
> hand-written C bind, I must admit.
>
>> (I just used the supplied Python bindings of libsvm so this could be
>> interesting.)
>
> Well, we provide much more, like access to the weights, or vectorized
> predict :).
>
>> > Lets say that the focus between scikit.learn and statsmodel is most
>> > probably going to be slightly different.
>
>> Having done both (with papers), I find this type of comment assuming
>> because underlying both is the same concepts. What I would like to avoid
>> is having different user syntax for basic models for the same model. For
>> example, with logistic regression in SAS you have to be careful of which
>> is the default event setting as it varies across procedures. At least
>> these SAS procedures use the same unmodified dataset unlike some of the
>> R packages that do lars/lasso.
>
> Indeed, I agree. We'll try to look very closely at statsmodel and not
> differ if we can. However, (rant ahead), we hear this story everywhere we
> go: match our API. So we are struggling between pymvpa, mdp and statmodel
> (I am probably forgetting a few) that all differ slightly. We are willing
> to adapt as long as it is not damaging for our usecases, but it would be
> nice to have a common discussion.
>
> Also, there will be differences APIs, as far as I understand the
> statsmodel API. For instance, I believe that constructors of models
> should work without passing it the data (the data could be optional). The
> reason being that on-line estimators shouldn't be passed in
> initiallisation data. As a consequence, maybe the 'fit' method should
> take the data... All this is quite open to me, and I don't want to draw
> any premature conclusion.
>

Just a quick comment (disclaimer: all my own thoughts and
misunderstandings...feel free to correct me).  Historically, the
statsmodels package accepted a design during the model instantiation
then you used your dependent variable during the fit method.  To my
mind, though this didn't seem to make much sense for how I think of a
model (probably somewhat discipline specific?).  For the estimators
that we have we are usually fitting a parametric model in order to
test a given theory about the data generating process.  The model
doesn't make much sense to me without the data (my data is not
real-time and I am not data mining).  Again though I want to make the
package as useful to others as possible (without alienating those who
think of models as I do), so of course suggestions on how to improve
the API or make it more general are more than welcome.

> We have not done any API design so far, because we are trying to
> get a feal of what the existing APIs are, and because we want to have
> working code to throw usecases at it. Also, we are extremely open to
> comments, just subscribe to the scikit.learn mailing list (not everybody
> involved with scikit learn follows this high-traffic mailing list).
>
>> >> What would be nice is the acceptance of input data types between learn
>> >> and statsmodels especially for things like logistic regression. While I
>> >> understand the need for duplicate functions, it may be desirable share
>> >> at least code since both code bases are still relatively 'new'.
>
>> > Well, as far as I am concerned, data types are numpy arrays. I am weary
>> > of implmenting higher level abstractions. Its more the APIs that may
>> > different, and that we will have to keep in sync.
>
>> I do agree especially now that I have learnt the 'array' approach of
>> doing things.
>
>> In some way my view of integration of things is Zelig -not that I have
>> really looked at it (as it is in R) :
>> http://gking.harvard.edu/zelig/
>
> Well, let us try not to have to build common API and integration a
> posteriori, build right from the start. A bit of API work is well worth
> the effort, I believe. And please feal free to pitch in.
>

To my mind, the burden is probably more on statsmodels to provide an
interface to the learn code, as we would be more likely to take
advantage of your routines.

>> The seamless ability to link packages is rather appealing and both
>> scikits share at least numpy.
>
> And scipy, I believe.
>
> Cheers,
>
> Gaël
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>