[scikit-learn] Can fit a model with a target array of probabilities?
Stuart Reynolds
stuart at stuartreynolds.net
Wed Oct 18 14:37:51 EDT 2017
Good know -- thank you.
On Fri, Oct 6, 2017 at 5:25 AM, <josef.pktd at gmail.com> wrote:
>
>
> On Thu, Oct 5, 2017 at 3:27 PM, <josef.pktd at gmail.com> wrote:
>>
>>
>>
>> On Thu, Oct 5, 2017 at 2:52 PM, Stuart Reynolds
>> <stuart at stuartreynolds.net> wrote:
>>>
>>> Turns out sm.Logit does allow setting the tolerance.
>>> Some and quick and dirty time profiling of different methods on a 100k
>>> * 30 features dataset, with different solvers and losses:
>>>
>>> sklearn.LogisticRegression: l1 1.13864398003 (seconds)
>>> sklearn.LogisticRegression: l2 0.0538778305054
>>> sm.Logit l1 0.0922629833221 # Although didn't converge
>>> sm.Logit l1_cvxopt_cp 0.958268165588
>>> sm.Logit newton 0.133476018906
>>> sm.Logit nm 0.369864940643
>>> sm.Logit bfgs 0.105798006058
>>> sm.Logit lbfgs 0.06241106987
>>> sm.Logit powell 1.64219808578
>>> sm.Logit cg 0.2184278965
>>> sm.Logit ncg 0.216138124466
>>> sm.Logit basinhopping 8.82164621353
>>> sm.GLM.fit IRLS 0.544688940048
>>> sm.GLM L2: 1.29778695107
>>>
>>> I've been getting good results from sm.Logit.fit (although
>>> unregularized).
>>> statsmodels GLM seems a little slow. Not sure why.
>>>
>>> My benchmark may be a little apples-to-oranges, since the stopping
>>> criteria probably aren't comparable.
>>
>>
>> I think that's a problem with GLM IRLS.
>> AFAIK, but never fully tested, is that the objective function is
>> proportional to the number of observations and the convergence
>> criterion becomes tighter as nobs increases.
>> I don't find the issue or PR discussion anymore, but one of our
>> contributors fixed maxiter at 15 or something like that for IRLS with
>> around 4 to 5 million observations and mostly categorical explanatory
>> variables in his application.
>>
>> unfortunately (no upfront design and decisions across models)
>> https://github.com/statsmodels/statsmodels/issues/2825
>
>
>
> Interesting timing excercise, I tried a bit more yesterday.
>
> GLM IRLS is not slow because of the convergence criterion, but it seems like
> it takes much longer when the design matrix is not well conditioned.
> The random dataset generated by sklearn has singular values in the range of
> 1e-14 or 1e-15
> This doesn't affect the other estimators much and lbfgs is almost always the
> fastest with bfgs close behind.
>
> When I add some noise to the feature matrix so it's not so close to
> singular, then IRLS is roughly in the same neighborhood as the faster scipy
> optimizers
> With n_samples=1000000, n_features=50, Logit is around 5 or 6 seconds (for
> lbfgs, bfgs and newton) slightly faster than sklearnLogistic regression
> regularized, but GLM is about 4 times slower with 17 to 20 seconds
> GLM L2 is much slower in this case because of the current non-optimized
> implementation of coordinate descend.
>
> aside: In master and next release of statsmodels there is a interface to
> scipy.minimize, which allows that all new optimizers can be used, e.g.
> dogleg and other new trust region newton methods will be better optimizers
> for many cases.
>
> Josef
>
>
>>
>>
>> Josef
>>
>>
>>>
>>>
>>>
>>> For tiny models, which I'm also building: 100 samples, 5 features
>>>
>>> sklearn.LogisticRegression: l1 0.00137376785278
>>> sklearn.LogisticRegression: l2 0.00167894363403
>>> sm.Logit l1 0.0198900699615
>>> sm.Logit l1_cvxopt_cp 0.162448167801
>>> sm.Logit newton 0.00689911842346
>>> sm.Logit nm 0.0754928588867
>>> sm.Logit bfgs 0.0210938453674
>>> sm.Logit lbfgs 0.0156588554382
>>> sm.Logit powell 0.0161390304565
>>> sm.Logit cg 0.00759506225586
>>> sm.Logit ncg 0.00541186332703
>>> sm.Logit basinhopping 0.3076171875
>>> sm.GLM.fit IRLS 0.00902199745178
>>> sm.GLM L2: 0.0208361148834
>>>
>>> I couldn't get sm.GLM.fit to work with non "IRLS" solvers. (hits a
>>> division by zero).
>>>
>>>
>>> ----
>>>
>>> import sklearn.datasets
>>> from sklearn.preprocessing import StandardScaler
>>> X, y = sklearn.datasets.make_classification(n_samples=10000,
>>> n_features=30, random_state=123)
>>> X = StandardScaler(copy=True, with_mean=True,
>>> with_std=True).fit_transform(X)
>>>
>>> import time
>>> tol = 0.0001
>>> maxiter = 100
>>> DISP = 0
>>>
>>>
>>> if 1: # sk.LogisticRegression
>>> import sklearn
>>> from sklearn.linear_model import LogisticRegression
>>>
>>> for method in ["l1", "l2"]: # TODO, add solvers:
>>> t = time.time()
>>> model = LogisticRegression(C=1, tol=tol, max_iter=maxiter,
>>> penalty=method)
>>> model.fit(X,y)
>>> print "sklearn.LogisticRegression:", method, time.time() - t
>>>
>>>
>>>
>>>
>>> if 1: # sm.Logit.fit_regularized
>>> from statsmodels.discrete.discrete_model import Logit
>>> for method in ["l1", "l1_cvxopt_cp"]:
>>> t = time.time()
>>> model = Logit(y,X)
>>> result = model.fit_regularized(method=method, maxiter=maxiter,
>>> alpha=1.,
>>> abstol=tol,
>>> acc=tol,
>>> tol=tol, gtol=tol, pgtol=tol,
>>> disp=DISP)
>>> print "sm.Logit", method, time.time() - t
>>>
>>> if 1: # sm.Logit.fit
>>> from statsmodels.discrete.discrete_model import Logit
>>>
>>> SOLVERS = ["newton", "nm",
>>> "bfgs","lbfgs","powell","cg","ncg","basinhopping",]
>>> for method in SOLVERS:
>>> t = time.time()
>>> model = Logit(y,X)
>>> result = model.fit(method=method, maxiter=maxiter,
>>> niter=maxiter,
>>> ftol=tol,
>>> tol=tol, gtol=tol, pgtol=tol, # Hmmm..
>>> needs to be reviewed.
>>> disp=DISP)
>>> print "sm.Logit", method, time.time() - t
>>>
>>> if 1: # sm.GLM.fit
>>> from statsmodels.genmod.generalized_linear_model import GLM
>>> from statsmodels.genmod.generalized_linear_model import families
>>> for method in ["IRLS"]:
>>> t = time.time()
>>> model = GLM(y, X,
>>> family=families.Binomial(link=families.links.logit))
>>> result = model.fit(method=method, cnvrg_tol=tol,
>>> maxiter=maxiter, full_output=False, disp=DISP)
>>> print "sm.GLM.fit", method, time.time() - t
>>>
>>>
>>> if 1: # GLM.fit_regularized
>>> from statsmodels.genmod.generalized_linear_model import GLM
>>> from statsmodels.genmod.generalized_linear_model import families
>>> t = time.time()
>>> model = GLM(y, X,
>>> family=families.Binomial(link=families.links.logit))
>>> result = model.fit_regularized(method='elastic_net', alpha=1.0,
>>> L1_wt=0.0, cnvrg_tol=tol, maxiter=maxiter)
>>> print "sm.GLM L2:", time.time() - t
>>>
>>>
>>>
>>> if 0: # GLM.fit
>>> # Hits division by zero.
>>> SOLVERS = ["bfgs","lbfgs", "netwon", "nm",
>>> "powell","cg","ncg","basinhopping",]
>>> from statsmodels.genmod.generalized_linear_model import GLM
>>> from statsmodels.genmod.generalized_linear_model import families
>>> for method in SOLVERS:
>>> t = time.time()
>>> model = GLM(y, X,
>>> family=families.Binomial(link=families.links.logit))
>>> result = model.fit(method=method,
>>> # scale="X2",
>>> # alpha=1.,
>>> # abstol=tol,
>>> # acc=tol,
>>> # tol=tol, gtol=tol, pgtol=tol,
>>> # maxiter=maxiter,
>>> # #full_output=False,
>>> disp=DISP)
>>> print "sm.GLM.fit", method, time.time() - t
>>>
>>>
>>> On Thu, Oct 5, 2017 at 10:32 AM, Sean Violante <sean.violante at gmail.com>
>>> wrote:
>>> > Stuart
>>> > have you tried glmnet ( in R) there is a python version
>>> > https://web.stanford.edu/~hastie/glmnet_python/ ....
>>> >
>>> >
>>> >
>>> >
>>> > On Thu, Oct 5, 2017 at 6:34 PM, Stuart Reynolds
>>> > <stuart at stuartreynolds.net>
>>> > wrote:
>>> >>
>>> >> Thanks Josef. Was very useful.
>>> >>
>>> >> result.remove_data() reduces a 5 parameter Logit result object from
>>> >> megabytes to 5Kb (as compared to a minimum uncompressed size of the
>>> >> parameters of ~320 bytes). Is big improvement. I'll experiment with
>>> >> what you suggest -- since this is still >10x larger than possible. I
>>> >> think the difference is mostly attribute names.
>>> >> I don't mind the lack of a multinomial support. I've often had better
>>> >> results mixing independent models for each class.
>>> >>
>>> >> I'll experiment with the different solvers. I tried the Logit model
>>> >> in the past -- its fit function only exposed a maxiter, and not a
>>> >> tolerance -- meaning I had to set maxiter very high. The newer
>>> >> statsmodels GLM module looks great and seem to solve this.
>>> >>
>>> >> For other who come this way, I think the magic for ridge regression
>>> >> is:
>>> >>
>>> >> from statsmodels.genmod.generalized_linear_model import GLM
>>> >> from statsmodels.genmod.generalized_linear_model import
>>> >> families
>>> >> from statsmodels.genmod.generalized_linear_model.families
>>> >> import
>>> >> links
>>> >>
>>> >> model = GLM(y, Xtrain,
>>> >> family=families.Binomial(link=links.Logit))
>>> >> result = model.fit_regularized(method='elastic_net',
>>> >> alpha=l2weight, L1_wt=0.0, tol=...)
>>> >> result.remove_data()
>>> >> result.predict(Xtest)
>>> >>
>>> >> One last thing -- its clear that it should be possible to do something
>>> >> like scikit's LogisticRegressionCV in order to quickly optimize a
>>> >> single parameter by re-using past coefficients.
>>> >> Are there any wrappers in statsmodels for doing this or should I roll
>>> >> my
>>> >> own?
>>> >>
>>> >>
>>> >> - Stu
>>> >>
>>> >>
>>> >> On Wed, Oct 4, 2017 at 3:43 PM, <josef.pktd at gmail.com> wrote:
>>> >> >
>>> >> >
>>> >> > On Wed, Oct 4, 2017 at 4:26 PM, Stuart Reynolds
>>> >> > <stuart at stuartreynolds.net>
>>> >> > wrote:
>>> >> >>
>>> >> >> Hi Andy,
>>> >> >> Thanks -- I'll give another statsmodels another go.
>>> >> >> I remember I had some fitting speed issues with it in the past, and
>>> >> >> also some issues related their models keeping references to the
>>> >> >> data
>>> >> >> (=disaster for serialization and multiprocessing) -- although that
>>> >> >> was
>>> >> >> a long time ago.
>>> >> >
>>> >> >
>>> >> > The second has not changed and will not change, but there is a
>>> >> > remove_data
>>> >> > method that deletes all references to full, data sized arrays.
>>> >> > However,
>>> >> > once
>>> >> > the data is removed, it is not possible anymore to compute any new
>>> >> > results
>>> >> > statistics which are almost all lazily computed.
>>> >> > The fitting speed depends a lot on the optimizer, convergence
>>> >> > criteria
>>> >> > and
>>> >> > difficulty of the problem, and availability of good starting
>>> >> > parameters.
>>> >> > Almost all nonlinear estimation problems use the scipy optimizers,
>>> >> > all
>>> >> > unconstrained optimizers can be used. There are no optimized special
>>> >> > methods
>>> >> > for cases with a very large number of features.
>>> >> >
>>> >> > Multinomial/multiclass models don't support continuous response
>>> >> > (yet),
>>> >> > all
>>> >> > other GLM and discrete models allow for continuous data in the
>>> >> > interval
>>> >> > extension of the domain.
>>> >> >
>>> >> > Josef
>>> >> >
>>> >> >
>>> >> >>
>>> >> >> - Stuart
>>> >> >>
>>> >> >> On Wed, Oct 4, 2017 at 1:09 PM, Andreas Mueller <t3kcit at gmail.com>
>>> >> >> wrote:
>>> >> >> > Hi Stuart.
>>> >> >> > There is no interface to do this in scikit-learn (and maybe we
>>> >> >> > should
>>> >> >> > at
>>> >> >> > this to the FAQ).
>>> >> >> > Yes, in principle this would be possible with several of the
>>> >> >> > models.
>>> >> >> >
>>> >> >> > I think statsmodels can do that, and I think I saw another glm
>>> >> >> > package
>>> >> >> > for Python that does that?
>>> >> >> >
>>> >> >> > It's certainly a legitimate use-case but would require
>>> >> >> > substantial
>>> >> >> > changes to the code. I think so far we decided not to support
>>> >> >> > this in scikit-learn. Basically we don't have a concept of a link
>>> >> >> > function, and it's a concept that only applies to a subset of
>>> >> >> > models.
>>> >> >> > We try to have a consistent interface for all our estimators, and
>>> >> >> > this doesn't really fit well within that interface.
>>> >> >> >
>>> >> >> > Hth,
>>> >> >> > Andy
>>> >> >> >
>>> >> >> >
>>> >> >> > On 10/04/2017 03:58 PM, Stuart Reynolds wrote:
>>> >> >> >>
>>> >> >> >> I'd like to fit a model that maps a matrix of continuous inputs
>>> >> >> >> to a
>>> >> >> >> target that's between 0 and 1 (a probability).
>>> >> >> >>
>>> >> >> >> In principle, I'd expect logistic regression should work out of
>>> >> >> >> the
>>> >> >> >> box with no modification (although its often posed as being
>>> >> >> >> strictly
>>> >> >> >> for classification, its loss function allows for fitting targets
>>> >> >> >> in
>>> >> >> >> the range 0 to 1, and not strictly zero or one.)
>>> >> >> >>
>>> >> >> >> However, scikit's LogisticRegression and LogisticRegressionCV
>>> >> >> >> reject
>>> >> >> >> target arrays that are continuous. Other LR implementations
>>> >> >> >> allow a
>>> >> >> >> matrix of probability estimates. Looking at:
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> http://scikit-learn-general.narkive.com/4dSCktaM/using-logistic-regression-on-a-continuous-target-variable
>>> >> >> >> and the fix here:
>>> >> >> >> https://github.com/scikit-learn/scikit-learn/pull/5084, which
>>> >> >> >> disables
>>> >> >> >> continuous inputs, it looks like there was some reason for this.
>>> >> >> >> So
>>> >> >> >> ... I'm looking for alternatives.
>>> >> >> >>
>>> >> >> >> SGDClassifier allows log loss and (if I understood the docs
>>> >> >> >> correctly)
>>> >> >> >> adds a logistic link function, but also rejects continuous
>>> >> >> >> targets.
>>> >> >> >> Oddly, SGDRegressor only allows ‘squared_loss’, ‘huber’,
>>> >> >> >> ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’, and
>>> >> >> >> doesn't
>>> >> >> >> seems to give a logistic function.
>>> >> >> >>
>>> >> >> >> In principle, GLM allow this, but scikit's docs say the GLM
>>> >> >> >> models
>>> >> >> >> only allows strict linear functions of their input, and doesn't
>>> >> >> >> allow
>>> >> >> >> a logistic link function. The docs direct people to the
>>> >> >> >> LogisticRegression class for this case.
>>> >> >> >>
>>> >> >> >> In R, there is:
>>> >> >> >>
>>> >> >> >> glm(Total_Service_Points_Won/Total_Service_Points_Played ~ ... ,
>>> >> >> >> family = binomial(link=logit), weights =
>>> >> >> >> Total_Service_Points_Played)
>>> >> >> >> which would be ideal.
>>> >> >> >>
>>> >> >> >> Is something similar available in scikit? (Or any continuous
>>> >> >> >> model
>>> >> >> >> that takes and 0 to 1 target and outputs a 0 to 1 target?)
>>> >> >> >>
>>> >> >> >> I was surprised to see that the implementation of
>>> >> >> >> CalibratedClassifierCV(method="sigmoid") uses an internal
>>> >> >> >> implementation of logistic regression to do its logistic
>>> >> >> >> regressing
>>> >> >> >> --
>>> >> >> >> which I can use, although I'd prefer to use a user-facing
>>> >> >> >> library.
>>> >> >> >>
>>> >> >> >> Thanks,
>>> >> >> >> - Stuart
>>> >> >> >> _______________________________________________
>>> >> >> >> scikit-learn mailing list
>>> >> >> >> scikit-learn at python.org
>>> >> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>>> >> >> >
>>> >> >> >
>>> >> >> > _______________________________________________
>>> >> >> > scikit-learn mailing list
>>> >> >> > scikit-learn at python.org
>>> >> >> > https://mail.python.org/mailman/listinfo/scikit-learn
>>> >> >> _______________________________________________
>>> >> >> scikit-learn mailing list
>>> >> >> scikit-learn at python.org
>>> >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>>> >> >
>>> >> >
>>> >> >
>>> >> > _______________________________________________
>>> >> > scikit-learn mailing list
>>> >> > scikit-learn at python.org
>>> >> > https://mail.python.org/mailman/listinfo/scikit-learn
>>> >> >
>>> >> _______________________________________________
>>> >> scikit-learn mailing list
>>> >> scikit-learn at python.org
>>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > scikit-learn mailing list
>>> > scikit-learn at python.org
>>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>> >
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
More information about the scikit-learn
mailing list