[scikit-learn] Can fit a model with a target array of probabilities?

Thu Oct 5 15:27:22 EDT 2017

On Thu, Oct 5, 2017 at 2:52 PM, Stuart Reynolds <stuart at stuartreynolds.net>
wrote:

> Turns out sm.Logit does allow setting the tolerance.
> Some and quick and dirty time profiling of different methods on a 100k
> * 30 features dataset, with different solvers and losses:
>
> sklearn.LogisticRegression: l1 1.13864398003 (seconds)
> sklearn.LogisticRegression: l2 0.0538778305054
> sm.Logit l1 0.0922629833221  # Although didn't converge
> sm.Logit l1_cvxopt_cp 0.958268165588
> sm.Logit newton 0.133476018906
> sm.Logit nm 0.369864940643
> sm.Logit bfgs 0.105798006058
> sm.Logit lbfgs 0.06241106987
> sm.Logit powell 1.64219808578
> sm.Logit cg 0.2184278965
> sm.Logit ncg 0.216138124466
> sm.Logit basinhopping 8.82164621353
> sm.GLM.fit IRLS 0.544688940048
> sm.GLM L2: 1.29778695107
>
> I've been getting good results from sm.Logit.fit (although unregularized).
> statsmodels GLM seems a little slow. Not sure why.
>
> My benchmark may be a little apples-to-oranges, since the stopping
> criteria probably aren't comparable.
>

I think that's a problem with GLM IRLS.
AFAIK, but never fully tested, is that the objective function is
proportional to the number of observations and the convergence
criterion becomes tighter as nobs increases.
I don't find the issue or PR discussion anymore, but one of our
contributors fixed maxiter at 15 or something like that for IRLS with
around 4 to 5 million observations and mostly categorical explanatory
variables in his application.

unfortunately (no upfront design and decisions across models)
https://github.com/statsmodels/statsmodels/issues/2825

Josef

>
>
> For tiny models, which I'm also building: 100 samples, 5 features
>
> sklearn.LogisticRegression: l1 0.00137376785278
> sklearn.LogisticRegression: l2 0.00167894363403
> sm.Logit l1 0.0198900699615
> sm.Logit l1_cvxopt_cp 0.162448167801
> sm.Logit newton 0.00689911842346
> sm.Logit nm 0.0754928588867
> sm.Logit bfgs 0.0210938453674
> sm.Logit lbfgs 0.0156588554382
> sm.Logit powell 0.0161390304565
> sm.Logit cg 0.00759506225586
> sm.Logit ncg 0.00541186332703
> sm.Logit basinhopping 0.3076171875
> sm.GLM.fit IRLS 0.00902199745178
> sm.GLM L2: 0.0208361148834
>
> I couldn't get sm.GLM.fit to work with non "IRLS" solvers. (hits a
> division by zero).
>
> ----
>
> import sklearn.datasets
> from sklearn.preprocessing import StandardScaler
> X, y = sklearn.datasets.make_classification(n_samples=10000,
> n_features=30, random_state=123)
> X = StandardScaler(copy=True, with_mean=True, with_std=True).fit_transform(
> X)
>
> import time
> tol = 0.0001
> maxiter = 100
> DISP = 0
>
>
> if 1: # sk.LogisticRegression
>     import sklearn
>     from sklearn.linear_model import LogisticRegression
>
>     for method in ["l1", "l2"]: # TODO, add solvers:
>         t = time.time()
>         model = LogisticRegression(C=1, tol=tol, max_iter=maxiter,
> penalty=method)
>         model.fit(X,y)
>         print "sklearn.LogisticRegression:", method, time.time() - t
>
>
>
>
> if 1: # sm.Logit.fit_regularized
>     from statsmodels.discrete.discrete_model import Logit
>     for method in ["l1", "l1_cvxopt_cp"]:
>         t = time.time()
>         model = Logit(y,X)
>         result = model.fit_regularized(method=method, maxiter=maxiter,
>                                        alpha=1.,
>                                        abstol=tol,
>                                        acc=tol,
>                                        tol=tol, gtol=tol, pgtol=tol,
>                                        disp=DISP)
>         print "sm.Logit", method, time.time() - t
>
> if 1: # sm.Logit.fit
>     from statsmodels.discrete.discrete_model import Logit
>
>     SOLVERS = ["newton", "nm",
> "bfgs","lbfgs","powell","cg","ncg","basinhopping",]
>     for method in SOLVERS:
>         t = time.time()
>         model = Logit(y,X)
>         result = model.fit(method=method, maxiter=maxiter,
>                            niter=maxiter,
>                            ftol=tol,
>                            tol=tol, gtol=tol, pgtol=tol,  # Hmmm..
> needs to be reviewed.
>                            disp=DISP)
>         print "sm.Logit", method, time.time() - t
>
> if 1: # sm.GLM.fit
>     from statsmodels.genmod.generalized_linear_model import GLM
>     from statsmodels.genmod.generalized_linear_model import families
>     for method in ["IRLS"]:
>         t = time.time()
>         model = GLM(y, X, family=families.Binomial(link=
> families.links.logit))
>         result = model.fit(method=method, cnvrg_tol=tol,
> maxiter=maxiter, full_output=False, disp=DISP)
>         print "sm.GLM.fit", method, time.time() - t
>
>
> if 1: # GLM.fit_regularized
>     from statsmodels.genmod.generalized_linear_model import GLM
>     from statsmodels.genmod.generalized_linear_model import families
>     t = time.time()
>     model = GLM(y, X, family=families.Binomial(link=families.links.logit))
>     result = model.fit_regularized(method='elastic_net', alpha=1.0,
> L1_wt=0.0, cnvrg_tol=tol, maxiter=maxiter)
>     print "sm.GLM L2:", time.time() - t
>
>
>
> if 0: # GLM.fit
>     # Hits division by zero.
>     SOLVERS = ["bfgs","lbfgs", "netwon", "nm",
> "powell","cg","ncg","basinhopping",]
>     from statsmodels.genmod.generalized_linear_model import GLM
>     from statsmodels.genmod.generalized_linear_model import families
>     for method in SOLVERS:
>         t = time.time()
>         model = GLM(y, X, family=families.Binomial(link=
> families.links.logit))
>         result = model.fit(method=method,
> #                           scale="X2",
> #                            alpha=1.,
> #                            abstol=tol,
> #                            acc=tol,
> #                            tol=tol, gtol=tol, pgtol=tol,
> #                            maxiter=maxiter,
> #                            #full_output=False,
>                            disp=DISP)
>         print "sm.GLM.fit", method, time.time() - t
>
>
> On Thu, Oct 5, 2017 at 10:32 AM, Sean Violante <sean.violante at gmail.com>
> wrote:
> > Stuart
> > have you tried glmnet ( in R) there is a python version
> > https://web.stanford.edu/~hastie/glmnet_python/ ....
> >
> >
> >
> >
> > On Thu, Oct 5, 2017 at 6:34 PM, Stuart Reynolds <
> stuart at stuartreynolds.net>
> > wrote:
> >>
> >> Thanks Josef. Was very useful.
> >>
> >> result.remove_data() reduces a 5 parameter Logit result object from
> >> megabytes to 5Kb (as compared to a minimum uncompressed size of the
> >> parameters of ~320 bytes). Is big improvement. I'll experiment with
> >> what you suggest -- since this is still >10x larger than possible. I
> >> think the difference is mostly attribute names.
> >> I don't mind the lack of a multinomial support. I've often had better
> >> results mixing independent models for each class.
> >>
> >> I'll experiment with the different solvers.  I tried the Logit model
> >> in the past -- its fit function only exposed a maxiter, and not a
> >> tolerance -- meaning I had to set maxiter very high. The newer
> >> statsmodels GLM module looks great and seem to solve this.
> >>
> >> For other who come this way, I think the magic for ridge regression is:
> >>
> >>         from statsmodels.genmod.generalized_linear_model import GLM
> >>         from statsmodels.genmod.generalized_linear_model import
> families
> >>         from statsmodels.genmod.generalized_linear_model.families
> import
> >> links
> >>
> >>         model = GLM(y, Xtrain, family=families.Binomial(link=
> links.Logit))
> >>         result = model.fit_regularized(method='elastic_net',
> >> alpha=l2weight, L1_wt=0.0, tol=...)
> >>         result.remove_data()
> >>         result.predict(Xtest)
> >>
> >> One last thing -- its clear that it should be possible to do something
> >> like scikit's LogisticRegressionCV in order to quickly optimize a
> >> single parameter by re-using past coefficients.
> >> Are there any wrappers in statsmodels for doing this or should I roll my
> >> own?
> >>
> >>
> >> - Stu
> >>
> >>
> >> On Wed, Oct 4, 2017 at 3:43 PM,  <josef.pktd at gmail.com> wrote:
> >> >
> >> >
> >> > On Wed, Oct 4, 2017 at 4:26 PM, Stuart Reynolds
> >> > <stuart at stuartreynolds.net>
> >> > wrote:
> >> >>
> >> >> Hi Andy,
> >> >> Thanks -- I'll give another statsmodels another go.
> >> >> I remember I had some fitting speed issues with it in the past, and
> >> >> also some issues related their models keeping references to the data
> >> >> (=disaster for serialization and multiprocessing) -- although that
> was
> >> >> a long time ago.
> >> >
> >> >
> >> > The second has not changed and will not change, but there is a
> >> > remove_data
> >> > method that deletes all references to full, data sized arrays.
> However,
> >> > once
> >> > the data is removed, it is not possible anymore to compute any new
> >> > results
> >> > statistics which are almost all lazily computed.
> >> > The fitting speed depends a lot on the optimizer, convergence criteria
> >> > and
> >> > difficulty of the problem, and availability of good starting
> parameters.
> >> > Almost all nonlinear estimation problems use the scipy optimizers, all
> >> > unconstrained optimizers can be used. There are no optimized special
> >> > methods
> >> > for cases with a very large number of features.
> >> >
> >> > Multinomial/multiclass models don't support continuous response (yet),
> >> > all
> >> > other GLM and discrete models allow for continuous data in the
> interval
> >> > extension of the domain.
> >> >
> >> > Josef
> >> >
> >> >
> >> >>
> >> >> - Stuart
> >> >>
> >> >> On Wed, Oct 4, 2017 at 1:09 PM, Andreas Mueller <t3kcit at gmail.com>
> >> >> wrote:
> >> >> > Hi Stuart.
> >> >> > There is no interface to do this in scikit-learn (and maybe we
> should
> >> >> > at
> >> >> > this to the FAQ).
> >> >> > Yes, in principle this would be possible with several of the
> models.
> >> >> >
> >> >> > I think statsmodels can do that, and I think I saw another glm
> >> >> > package
> >> >> > for Python that does that?
> >> >> >
> >> >> > It's certainly a legitimate use-case but would require substantial
> >> >> > changes to the code. I think so far we decided not to support
> >> >> > this in scikit-learn. Basically we don't have a concept of a link
> >> >> > function, and it's a concept that only applies to a subset of
> models.
> >> >> > We try to have a consistent interface for all our estimators, and
> >> >> > this doesn't really fit well within that interface.
> >> >> >
> >> >> > Hth,
> >> >> > Andy
> >> >> >
> >> >> >
> >> >> > On 10/04/2017 03:58 PM, Stuart Reynolds wrote:
> >> >> >>
> >> >> >> I'd like to fit a model that maps a matrix of continuous inputs
> to a
> >> >> >> target that's between 0 and 1 (a probability).
> >> >> >>
> >> >> >> In principle, I'd expect logistic regression should work out of
> the
> >> >> >> box with no modification (although its often posed as being
> strictly
> >> >> >> for classification, its loss function allows for fitting targets
> in
> >> >> >> the range 0 to 1, and not strictly zero or one.)
> >> >> >>
> >> >> >> However, scikit's LogisticRegression and LogisticRegressionCV
> reject
> >> >> >> target arrays that are continuous. Other LR implementations allow
> a
> >> >> >> matrix of probability estimates. Looking at:
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> http://scikit-learn-general.narkive.com/4dSCktaM/using-
> logistic-regression-on-a-continuous-target-variable
> >> >> >> and the fix here:
> >> >> >> https://github.com/scikit-learn/scikit-learn/pull/5084, which
> >> >> >> disables
> >> >> >> continuous inputs, it looks like there was some reason for this.
> So
> >> >> >> ... I'm looking for alternatives.
> >> >> >>
> >> >> >> SGDClassifier allows log loss and (if I understood the docs
> >> >> >> correctly)
> >> >> >> adds a logistic link function, but also rejects continuous
> targets.
> >> >> >> Oddly, SGDRegressor only allows  ‘squared_loss’, ‘huber’,
> >> >> >> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and
> doesn't
> >> >> >> seems to give a logistic function.
> >> >> >>
> >> >> >> In principle, GLM allow this, but scikit's docs say the GLM models
> >> >> >> only allows strict linear functions of their input, and doesn't
> >> >> >> allow
> >> >> >> a logistic link function. The docs direct people to the
> >> >> >> LogisticRegression class for this case.
> >> >> >>
> >> >> >> In R, there is:
> >> >> >>
> >> >> >> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... ,
> >> >> >>      family = binomial(link=logit), weights =
> >> >> >> Total_Service_Points_Played)
> >> >> >> which would be ideal.
> >> >> >>
> >> >> >> Is something similar available in scikit? (Or any continuous model
> >> >> >> that takes and 0 to 1 target and outputs a 0 to 1 target?)
> >> >> >>
> >> >> >> I was surprised to see that the implementation of
> >> >> >> CalibratedClassifierCV(method="sigmoid") uses an internal
> >> >> >> implementation of logistic regression to do its logistic
> regressing
> >> >> >> --
> >> >> >> which I can use, although I'd prefer to use a user-facing library.
> >> >> >>
> >> >> >> Thanks,
> >> >> >> - Stuart
> >> >> >> _______________________________________________
> >> >> >> scikit-learn mailing list
> >> >> >> scikit-learn at python.org
> >> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >> >> >
> >> >> >
> >> >> > _______________________________________________
> >> >> > scikit-learn mailing list
> >> >> > scikit-learn at python.org
> >> >> > https://mail.python.org/mailman/listinfo/scikit-learn
> >> >> _______________________________________________
> >> >> scikit-learn mailing list
> >> >> scikit-learn at python.org
> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >> >
> >> >
> >> >
> >> > _______________________________________________
> >> > scikit-learn mailing list
> >> > scikit-learn at python.org
> >> > https://mail.python.org/mailman/listinfo/scikit-learn
> >> >
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171005/4698d08e/attachment-0001.html>