[scikit-learn] Can fit a model with a target array of probabilities?
Stuart Reynolds
stuart at stuartreynolds.net
Thu Oct 5 12:34:51 EDT 2017
Thanks Josef. Was very useful.
result.remove_data() reduces a 5 parameter Logit result object from
megabytes to 5Kb (as compared to a minimum uncompressed size of the
parameters of ~320 bytes). Is big improvement. I'll experiment with
what you suggest -- since this is still >10x larger than possible. I
think the difference is mostly attribute names.
I don't mind the lack of a multinomial support. I've often had better
results mixing independent models for each class.
I'll experiment with the different solvers. I tried the Logit model
in the past -- its fit function only exposed a maxiter, and not a
tolerance -- meaning I had to set maxiter very high. The newer
statsmodels GLM module looks great and seem to solve this.
For other who come this way, I think the magic for ridge regression is:
from statsmodels.genmod.generalized_linear_model import GLM
from statsmodels.genmod.generalized_linear_model import families
from statsmodels.genmod.generalized_linear_model.families import links
model = GLM(y, Xtrain, family=families.Binomial(link=links.Logit))
result = model.fit_regularized(method='elastic_net',
alpha=l2weight, L1_wt=0.0, tol=...)
result.remove_data()
result.predict(Xtest)
One last thing -- its clear that it should be possible to do something
like scikit's LogisticRegressionCV in order to quickly optimize a
single parameter by re-using past coefficients.
Are there any wrappers in statsmodels for doing this or should I roll my own?
- Stu
On Wed, Oct 4, 2017 at 3:43 PM, <josef.pktd at gmail.com> wrote:
>
>
> On Wed, Oct 4, 2017 at 4:26 PM, Stuart Reynolds <stuart at stuartreynolds.net>
> wrote:
>>
>> Hi Andy,
>> Thanks -- I'll give another statsmodels another go.
>> I remember I had some fitting speed issues with it in the past, and
>> also some issues related their models keeping references to the data
>> (=disaster for serialization and multiprocessing) -- although that was
>> a long time ago.
>
>
> The second has not changed and will not change, but there is a remove_data
> method that deletes all references to full, data sized arrays. However, once
> the data is removed, it is not possible anymore to compute any new results
> statistics which are almost all lazily computed.
> The fitting speed depends a lot on the optimizer, convergence criteria and
> difficulty of the problem, and availability of good starting parameters.
> Almost all nonlinear estimation problems use the scipy optimizers, all
> unconstrained optimizers can be used. There are no optimized special methods
> for cases with a very large number of features.
>
> Multinomial/multiclass models don't support continuous response (yet), all
> other GLM and discrete models allow for continuous data in the interval
> extension of the domain.
>
> Josef
>
>
>>
>> - Stuart
>>
>> On Wed, Oct 4, 2017 at 1:09 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
>> > Hi Stuart.
>> > There is no interface to do this in scikit-learn (and maybe we should at
>> > this to the FAQ).
>> > Yes, in principle this would be possible with several of the models.
>> >
>> > I think statsmodels can do that, and I think I saw another glm package
>> > for Python that does that?
>> >
>> > It's certainly a legitimate use-case but would require substantial
>> > changes to the code. I think so far we decided not to support
>> > this in scikit-learn. Basically we don't have a concept of a link
>> > function, and it's a concept that only applies to a subset of models.
>> > We try to have a consistent interface for all our estimators, and
>> > this doesn't really fit well within that interface.
>> >
>> > Hth,
>> > Andy
>> >
>> >
>> > On 10/04/2017 03:58 PM, Stuart Reynolds wrote:
>> >>
>> >> I'd like to fit a model that maps a matrix of continuous inputs to a
>> >> target that's between 0 and 1 (a probability).
>> >>
>> >> In principle, I'd expect logistic regression should work out of the
>> >> box with no modification (although its often posed as being strictly
>> >> for classification, its loss function allows for fitting targets in
>> >> the range 0 to 1, and not strictly zero or one.)
>> >>
>> >> However, scikit's LogisticRegression and LogisticRegressionCV reject
>> >> target arrays that are continuous. Other LR implementations allow a
>> >> matrix of probability estimates. Looking at:
>> >>
>> >>
>> >> http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
>> >> and the fix here:
>> >> https://github.com/scikit-learn/scikit-learn/pull/5084, which disables
>> >> continuous inputs, it looks like there was some reason for this. So
>> >> ... I'm looking for alternatives.
>> >>
>> >> SGDClassifier allows log loss and (if I understood the docs correctly)
>> >> adds a logistic link function, but also rejects continuous targets.
>> >> Oddly, SGDRegressor only allows ‘squared_loss’, ‘huber’,
>> >> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and doesn't
>> >> seems to give a logistic function.
>> >>
>> >> In principle, GLM allow this, but scikit's docs say the GLM models
>> >> only allows strict linear functions of their input, and doesn't allow
>> >> a logistic link function. The docs direct people to the
>> >> LogisticRegression class for this case.
>> >>
>> >> In R, there is:
>> >>
>> >> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... ,
>> >> family = binomial(link=logit), weights =
>> >> Total_Service_Points_Played)
>> >> which would be ideal.
>> >>
>> >> Is something similar available in scikit? (Or any continuous model
>> >> that takes and 0 to 1 target and outputs a 0 to 1 target?)
>> >>
>> >> I was surprised to see that the implementation of
>> >> CalibratedClassifierCV(method="sigmoid") uses an internal
>> >> implementation of logistic regression to do its logistic regressing --
>> >> which I can use, although I'd prefer to use a user-facing library.
>> >>
>> >> Thanks,
>> >> - Stuart
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn at python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
More information about the scikit-learn
mailing list