[scikit-learn] Can fit a model with a target array of probabilities?

Wed Oct 4 16:26:58 EDT 2017

Hi Andy,
Thanks -- I'll give another statsmodels another go.
I remember I had some fitting speed issues with it in the past, and
also some issues related their models keeping references to the data
(=disaster for serialization and multiprocessing) -- although that was
a long time ago.
- Stuart

On Wed, Oct 4, 2017 at 1:09 PM, Andreas Mueller <t3kcit at gmail.com> wrote:
> Hi Stuart.
> There is no interface to do this in scikit-learn (and maybe we should at
> this to the FAQ).
> Yes, in principle this would be possible with several of the models.
>
> I think statsmodels can do that, and I think I saw another glm package
> for Python that does that?
>
> It's certainly a legitimate use-case but would require substantial
> changes to the code. I think so far we decided not to support
> this in scikit-learn. Basically we don't have a concept of a link
> function, and it's a concept that only applies to a subset of models.
> We try to have a consistent interface for all our estimators, and
> this doesn't really fit well within that interface.
>
> Hth,
> Andy
>
>
> On 10/04/2017 03:58 PM, Stuart Reynolds wrote:
>>
>> I'd like to fit a model that maps a matrix of continuous inputs to a
>> target that's between 0 and 1 (a probability).
>>
>> In principle, I'd expect logistic regression should work out of the
>> box with no modification (although its often posed as being strictly
>> for classification, its loss function allows for fitting targets in
>> the range 0 to 1, and not strictly zero or one.)
>>
>> However, scikit's LogisticRegression and LogisticRegressionCV reject
>> target arrays that are continuous. Other LR implementations allow a
>> matrix of probability estimates. Looking at:
>>
>> http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
>> and the fix here:
>> https://github.com/scikit-learn/scikit-learn/pull/5084, which disables
>> continuous inputs, it looks like there was some reason for this. So
>> ... I'm looking for alternatives.
>>
>> SGDClassifier allows log loss and (if I understood the docs correctly)
>> adds a logistic link function, but also rejects continuous targets.
>> Oddly, SGDRegressor only allows  ‘squared_loss’, ‘huber’,
>> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and doesn't
>> seems to give a logistic function.
>>
>> In principle, GLM allow this, but scikit's docs say the GLM models
>> only allows strict linear functions of their input, and doesn't allow
>> a logistic link function. The docs direct people to the
>> LogisticRegression class for this case.
>>
>> In R, there is:
>>
>> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... ,
>>      family = binomial(link=logit), weights = Total_Service_Points_Played)
>> which would be ideal.
>>
>> Is something similar available in scikit? (Or any continuous model
>> that takes and 0 to 1 target and outputs a 0 to 1 target?)
>>
>> I was surprised to see that the implementation of
>> CalibratedClassifierCV(method="sigmoid") uses an internal
>> implementation of logistic regression to do its logistic regressing --
>> which I can use, although I'd prefer to use a user-facing library.
>>
>> Thanks,
>> - Stuart
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn