[scikit-learn] Can fit a model with a target array of probabilities?

Fri Oct 6 08:25:25 EDT 2017

On Thu, Oct 5, 2017 at 3:27 PM, <josef.pktd at gmail.com> wrote:

>
>
> On Thu, Oct 5, 2017 at 2:52 PM, Stuart Reynolds <stuart at stuartreynolds.net
> > wrote:
>
>> Turns out sm.Logit does allow setting the tolerance.
>> Some and quick and dirty time profiling of different methods on a 100k
>> * 30 features dataset, with different solvers and losses:
>>
>> sklearn.LogisticRegression: l1 1.13864398003 (seconds)
>> sklearn.LogisticRegression: l2 0.0538778305054
>> sm.Logit l1 0.0922629833221  # Although didn't converge
>> sm.Logit l1_cvxopt_cp 0.958268165588
>> sm.Logit newton 0.133476018906
>> sm.Logit nm 0.369864940643
>> sm.Logit bfgs 0.105798006058
>> sm.Logit lbfgs 0.06241106987
>> sm.Logit powell 1.64219808578
>> sm.Logit cg 0.2184278965 <(218)%20427-8965>
>> sm.Logit ncg 0.216138124466
>> sm.Logit basinhopping 8.82164621353
>> sm.GLM.fit IRLS 0.544688940048
>> sm.GLM L2: 1.29778695107
>>
>> I've been getting good results from sm.Logit.fit (although unregularized).
>> statsmodels GLM seems a little slow. Not sure why.
>>
>> My benchmark may be a little apples-to-oranges, since the stopping
>> criteria probably aren't comparable.
>>
>
> I think that's a problem with GLM IRLS.
> AFAIK, but never fully tested, is that the objective function is
> proportional to the number of observations and the convergence
> criterion becomes tighter as nobs increases.
> I don't find the issue or PR discussion anymore, but one of our
> contributors fixed maxiter at 15 or something like that for IRLS with
> around 4 to 5 million observations and mostly categorical explanatory
> variables in his application.
>
> unfortunately (no upfront design and decisions across models)
> https://github.com/statsmodels/statsmodels/issues/2825
>

Interesting timing excercise, I tried a bit more yesterday.

GLM IRLS is not slow because of the convergence criterion, but it seems
like it takes much longer when the design matrix is not well conditioned.
The random dataset generated by sklearn has singular values in the range of
1e-14 or 1e-15
This doesn't affect the other estimators much and lbfgs is almost always
the fastest with bfgs close behind.

When I add some noise to the feature matrix so it's not so close to
singular, then IRLS is roughly in the same neighborhood as the faster scipy
optimizers
With n_samples=1000000, n_features=50, Logit is around 5 or 6 seconds (for
lbfgs, bfgs and newton) slightly faster than sklearnLogistic regression
regularized, but GLM is about 4 times slower with 17 to 20 seconds
GLM L2 is much slower in this case because of the current non-optimized
implementation of coordinate descend.

aside: In master and next release of statsmodels there is a interface to
scipy.minimize, which allows that all new optimizers can be used, e.g.
dogleg and other new trust region newton methods will be better optimizers
for many cases.

Josef

>
> Josef
>
>
>
>>
>>
>> For tiny models, which I'm also building: 100 samples, 5 features
>>
>> sklearn.LogisticRegression: l1 0.00137376785278
>> sklearn.LogisticRegression: l2 0.00167894363403
>> sm.Logit l1 0.0198900699615
>> sm.Logit l1_cvxopt_cp 0.162448167801
>> sm.Logit newton 0.00689911842346
>> sm.Logit nm 0.0754928588867
>> sm.Logit bfgs 0.0210938453674
>> sm.Logit lbfgs 0.0156588554382
>> sm.Logit powell 0.0161390304565
>> sm.Logit cg 0.00759506225586
>> sm.Logit ncg 0.00541186332703
>> sm.Logit basinhopping 0.3076171875 <(307)%20617-1875>
>> sm.GLM.fit IRLS 0.00902199745178
>> sm.GLM L2: 0.0208361148834
>>
>> I couldn't get sm.GLM.fit to work with non "IRLS" solvers. (hits a
>> division by zero).
>
>
>> ----
>>
>> import sklearn.datasets
>> from sklearn.preprocessing import StandardScaler
>> X, y = sklearn.datasets.make_classification(n_samples=10000,
>> n_features=30, random_state=123)
>> X = StandardScaler(copy=True, with_mean=True,
>> with_std=True).fit_transform(X)
>>
>> import time
>> tol = 0.0001
>> maxiter = 100
>> DISP = 0
>>
>>
>> if 1: # sk.LogisticRegression
>>     import sklearn
>>     from sklearn.linear_model import LogisticRegression
>>
>>     for method in ["l1", "l2"]: # TODO, add solvers:
>>         t = time.time()
>>         model = LogisticRegression(C=1, tol=tol, max_iter=maxiter,
>> penalty=method)
>>         model.fit(X,y)
>>         print "sklearn.LogisticRegression:", method, time.time() - t
>>
>>
>>
>>
>> if 1: # sm.Logit.fit_regularized
>>     from statsmodels.discrete.discrete_model import Logit
>>     for method in ["l1", "l1_cvxopt_cp"]:
>>         t = time.time()
>>         model = Logit(y,X)
>>         result = model.fit_regularized(method=method, maxiter=maxiter,
>>                                        alpha=1.,
>>                                        abstol=tol,
>>                                        acc=tol,
>>                                        tol=tol, gtol=tol, pgtol=tol,
>>                                        disp=DISP)
>>         print "sm.Logit", method, time.time() - t
>>
>> if 1: # sm.Logit.fit
>>     from statsmodels.discrete.discrete_model import Logit
>>
>>     SOLVERS = ["newton", "nm",
>> "bfgs","lbfgs","powell","cg","ncg","basinhopping",]
>>     for method in SOLVERS:
>>         t = time.time()
>>         model = Logit(y,X)
>>         result = model.fit(method=method, maxiter=maxiter,
>>                            niter=maxiter,
>>                            ftol=tol,
>>                            tol=tol, gtol=tol, pgtol=tol,  # Hmmm..
>> needs to be reviewed.
>>                            disp=DISP)
>>         print "sm.Logit", method, time.time() - t
>>
>> if 1: # sm.GLM.fit
>>     from statsmodels.genmod.generalized_linear_model import GLM
>>     from statsmodels.genmod.generalized_linear_model import families
>>     for method in ["IRLS"]:
>>         t = time.time()
>>         model = GLM(y, X, family=families.Binomial(link=
>> families.links.logit))
>>         result = model.fit(method=method, cnvrg_tol=tol,
>> maxiter=maxiter, full_output=False, disp=DISP)
>>         print "sm.GLM.fit", method, time.time() - t
>>
>>
>> if 1: # GLM.fit_regularized
>>     from statsmodels.genmod.generalized_linear_model import GLM
>>     from statsmodels.genmod.generalized_linear_model import families
>>     t = time.time()
>>     model = GLM(y, X, family=families.Binomial(link=
>> families.links.logit))
>>     result = model.fit_regularized(method='elastic_net', alpha=1.0,
>> L1_wt=0.0, cnvrg_tol=tol, maxiter=maxiter)
>>     print "sm.GLM L2:", time.time() - t
>>
>>
>>
>> if 0: # GLM.fit
>>     # Hits division by zero.
>>     SOLVERS = ["bfgs","lbfgs", "netwon", "nm",
>> "powell","cg","ncg","basinhopping",]
>>     from statsmodels.genmod.generalized_linear_model import GLM
>>     from statsmodels.genmod.generalized_linear_model import families
>>     for method in SOLVERS:
>>         t = time.time()
>>         model = GLM(y, X, family=families.Binomial(link=
>> families.links.logit))
>>         result = model.fit(method=method,
>> #                           scale="X2",
>> #                            alpha=1.,
>> #                            abstol=tol,
>> #                            acc=tol,
>> #                            tol=tol, gtol=tol, pgtol=tol,
>> #                            maxiter=maxiter,
>> #                            #full_output=False,
>>                            disp=DISP)
>>         print "sm.GLM.fit", method, time.time() - t
>>
>>
>> On Thu, Oct 5, 2017 at 10:32 AM, Sean Violante <sean.violante at gmail.com>
>> wrote:
>> > Stuart
>> > have you tried glmnet ( in R) there is a python version
>> > https://web.stanford.edu/~hastie/glmnet_python/ ....
>> >
>> >
>> >
>> >
>> > On Thu, Oct 5, 2017 at 6:34 PM, Stuart Reynolds <
>> stuart at stuartreynolds.net>
>> > wrote:
>> >>
>> >> Thanks Josef. Was very useful.
>> >>
>> >> result.remove_data() reduces a 5 parameter Logit result object from
>> >> megabytes to 5Kb (as compared to a minimum uncompressed size of the
>> >> parameters of ~320 bytes). Is big improvement. I'll experiment with
>> >> what you suggest -- since this is still >10x larger than possible. I
>> >> think the difference is mostly attribute names.
>> >> I don't mind the lack of a multinomial support. I've often had better
>> >> results mixing independent models for each class.
>> >>
>> >> I'll experiment with the different solvers.  I tried the Logit model
>> >> in the past -- its fit function only exposed a maxiter, and not a
>> >> tolerance -- meaning I had to set maxiter very high. The newer
>> >> statsmodels GLM module looks great and seem to solve this.
>> >>
>> >> For other who come this way, I think the magic for ridge regression is:
>> >>
>> >>         from statsmodels.genmod.generalized_linear_model import GLM
>> >>         from statsmodels.genmod.generalized_linear_model import
>> families
>> >>         from statsmodels.genmod.generalized_linear_model.families
>> import
>> >> links
>> >>
>> >>         model = GLM(y, Xtrain, family=families.Binomial(link=
>> links.Logit))
>> >>         result = model.fit_regularized(method='elastic_net',
>> >> alpha=l2weight, L1_wt=0.0, tol=...)
>> >>         result.remove_data()
>> >>         result.predict(Xtest)
>> >>
>> >> One last thing -- its clear that it should be possible to do something
>> >> like scikit's LogisticRegressionCV in order to quickly optimize a
>> >> single parameter by re-using past coefficients.
>> >> Are there any wrappers in statsmodels for doing this or should I roll
>> my
>> >> own?
>> >>
>> >>
>> >> - Stu
>> >>
>> >>
>> >> On Wed, Oct 4, 2017 at 3:43 PM,  <josef.pktd at gmail.com> wrote:
>> >> >
>> >> >
>> >> > On Wed, Oct 4, 2017 at 4:26 PM, Stuart Reynolds
>> >> > <stuart at stuartreynolds.net>
>> >> > wrote:
>> >> >>
>> >> >> Hi Andy,
>> >> >> Thanks -- I'll give another statsmodels another go.
>> >> >> I remember I had some fitting speed issues with it in the past, and
>> >> >> also some issues related their models keeping references to the data
>> >> >> (=disaster for serialization and multiprocessing) -- although that
>> was
>> >> >> a long time ago.
>> >> >
>> >> >
>> >> > The second has not changed and will not change, but there is a
>> >> > remove_data
>> >> > method that deletes all references to full, data sized arrays.
>> However,
>> >> > once
>> >> > the data is removed, it is not possible anymore to compute any new
>> >> > results
>> >> > statistics which are almost all lazily computed.
>> >> > The fitting speed depends a lot on the optimizer, convergence
>> criteria
>> >> > and
>> >> > difficulty of the problem, and availability of good starting
>> parameters.
>> >> > Almost all nonlinear estimation problems use the scipy optimizers,
>> all
>> >> > unconstrained optimizers can be used. There are no optimized special
>> >> > methods
>> >> > for cases with a very large number of features.
>> >> >
>> >> > Multinomial/multiclass models don't support continuous response
>> (yet),
>> >> > all
>> >> > other GLM and discrete models allow for continuous data in the
>> interval
>> >> > extension of the domain.
>> >> >
>> >> > Josef
>> >> >
>> >> >
>> >> >>
>> >> >> - Stuart
>> >> >>
>> >> >> On Wed, Oct 4, 2017 at 1:09 PM, Andreas Mueller <t3kcit at gmail.com>
>> >> >> wrote:
>> >> >> > Hi Stuart.
>> >> >> > There is no interface to do this in scikit-learn (and maybe we
>> should
>> >> >> > at
>> >> >> > this to the FAQ).
>> >> >> > Yes, in principle this would be possible with several of the
>> models.
>> >> >> >
>> >> >> > I think statsmodels can do that, and I think I saw another glm
>> >> >> > package
>> >> >> > for Python that does that?
>> >> >> >
>> >> >> > It's certainly a legitimate use-case but would require substantial
>> >> >> > changes to the code. I think so far we decided not to support
>> >> >> > this in scikit-learn. Basically we don't have a concept of a link
>> >> >> > function, and it's a concept that only applies to a subset of
>> models.
>> >> >> > We try to have a consistent interface for all our estimators, and
>> >> >> > this doesn't really fit well within that interface.
>> >> >> >
>> >> >> > Hth,
>> >> >> > Andy
>> >> >> >
>> >> >> >
>> >> >> > On 10/04/2017 03:58 PM, Stuart Reynolds wrote:
>> >> >> >>
>> >> >> >> I'd like to fit a model that maps a matrix of continuous inputs
>> to a
>> >> >> >> target that's between 0 and 1 (a probability).
>> >> >> >>
>> >> >> >> In principle, I'd expect logistic regression should work out of
>> the
>> >> >> >> box with no modification (although its often posed as being
>> strictly
>> >> >> >> for classification, its loss function allows for fitting targets
>> in
>> >> >> >> the range 0 to 1, and not strictly zero or one.)
>> >> >> >>
>> >> >> >> However, scikit's LogisticRegression and LogisticRegressionCV
>> reject
>> >> >> >> target arrays that are continuous. Other LR implementations
>> allow a
>> >> >> >> matrix of probability estimates. Looking at:
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> http://scikit-learn-general.narkive.com/4dSCktaM/using-logis
>> tic-regression-on-a-continuous-target-variable
>> >> >> >> and the fix here:
>> >> >> >> https://github.com/scikit-learn/scikit-learn/pull/5084, which
>> >> >> >> disables
>> >> >> >> continuous inputs, it looks like there was some reason for this.
>> So
>> >> >> >> ... I'm looking for alternatives.
>> >> >> >>
>> >> >> >> SGDClassifier allows log loss and (if I understood the docs
>> >> >> >> correctly)
>> >> >> >> adds a logistic link function, but also rejects continuous
>> targets.
>> >> >> >> Oddly, SGDRegressor only allows  ‘squared_loss’, ‘huber’,
>> >> >> >> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and
>> doesn't
>> >> >> >> seems to give a logistic function.
>> >> >> >>
>> >> >> >> In principle, GLM allow this, but scikit's docs say the GLM
>> models
>> >> >> >> only allows strict linear functions of their input, and doesn't
>> >> >> >> allow
>> >> >> >> a logistic link function. The docs direct people to the
>> >> >> >> LogisticRegression class for this case.
>> >> >> >>
>> >> >> >> In R, there is:
>> >> >> >>
>> >> >> >> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... ,
>> >> >> >>      family = binomial(link=logit), weights =
>> >> >> >> Total_Service_Points_Played)
>> >> >> >> which would be ideal.
>> >> >> >>
>> >> >> >> Is something similar available in scikit? (Or any continuous
>> model
>> >> >> >> that takes and 0 to 1 target and outputs a 0 to 1 target?)
>> >> >> >>
>> >> >> >> I was surprised to see that the implementation of
>> >> >> >> CalibratedClassifierCV(method="sigmoid") uses an internal
>> >> >> >> implementation of logistic regression to do its logistic
>> regressing
>> >> >> >> --
>> >> >> >> which I can use, although I'd prefer to use a user-facing
>> library.
>> >> >> >>
>> >> >> >> Thanks,
>> >> >> >> - Stuart
>> >> >> >> _______________________________________________
>> >> >> >> scikit-learn mailing list
>> >> >> >> scikit-learn at python.org
>> >> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >> >> >
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > scikit-learn mailing list
>> >> >> > scikit-learn at python.org
>> >> >> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >> >> _______________________________________________
>> >> >> scikit-learn mailing list
>> >> >> scikit-learn at python.org
>> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >> >
>> >> >
>> >> >
>> >> > _______________________________________________
>> >> > scikit-learn mailing list
>> >> > scikit-learn at python.org
>> >> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >> >
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn at python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171006/3f245349/attachment-0001.html>