From pahome.chen at mirlab.org  Thu Jan  3 22:44:44 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Fri, 4 Jan 2019 11:44:44 +0800
Subject: [scikit-learn] How GridSearchCV to get best_params?
Message-ID: <CAB3eZftuvB6hqnUsXGYsSQWw18ErL3a5fMfvf13vOLh63v7=1g@mail.gmail.com>

as title

In the doc it says:

best_params_ : dict
Parameter setting that gave the best results on the hold out data.

My question is what is the hold out data?
It's score of train data or test data, or mean of train and test score?

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190104/c056937f/attachment.html>

From mail at sebastianraschka.com  Thu Jan  3 22:50:16 2019
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Thu, 3 Jan 2019 21:50:16 -0600
Subject: [scikit-learn] How GridSearchCV to get best_params?
In-Reply-To: <CAB3eZftuvB6hqnUsXGYsSQWw18ErL3a5fMfvf13vOLh63v7=1g@mail.gmail.com>
References: <CAB3eZftuvB6hqnUsXGYsSQWw18ErL3a5fMfvf13vOLh63v7=1g@mail.gmail.com>
Message-ID: <21200DB3-457F-445B-B00F-12EF55F02908@sebastianraschka.com>

I think it refers to the test folds via the k-fold cross-validation that is internally used via the `cv` parameter of GridSearchCV (or the test folds of an alternative cross validation scheme that you may pass as an iterator to cv)

Best,
Sebastian

> On Jan 3, 2019, at 9:44 PM, lampahome <pahome.chen at mirlab.org> wrote:
> 
> as title
> 
> In the doc it says:
> 
> best_params_ : dict
> Parameter setting that gave the best results on the hold out data.
> 
> My question is what is the hold out data?
> It's score of train data or test data, or mean of train and test score?
> 
> thx
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From joel.nothman at gmail.com  Sat Jan  5 05:32:28 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Sat, 5 Jan 2019 21:32:28 +1100
Subject: [scikit-learn] How GridSearchCV to get best_params?
In-Reply-To: <21200DB3-457F-445B-B00F-12EF55F02908@sebastianraschka.com>
References: <CAB3eZftuvB6hqnUsXGYsSQWw18ErL3a5fMfvf13vOLh63v7=1g@mail.gmail.com>
 <21200DB3-457F-445B-B00F-12EF55F02908@sebastianraschka.com>
Message-ID: <CAAkaFLU0w3sgB=8svwb18y7_7MiAz3knOo9EJ=5Qo=Cw-xxZpA@mail.gmail.com>

See cv_results_['mean_test_score'] (or 'mean_test_x' where 'x' is the
scorer named in the refit parameter).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190105/cec061e3/attachment.html>

From gael.varoquaux at normalesup.org  Mon Jan  7 16:38:44 2019
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Mon, 7 Jan 2019 22:38:44 +0100
Subject: [scikit-learn] Next Sprint
In-Reply-To: <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com>
References: <RX6iIIyFIwuS7OUZgAO0I2OvQSl3IJ2D4ZBdJC2zO0i8gXkHwDMq_3oP1KtYtaamqgIcLhdX4CQCAJIj2_4oTdqT_wYL5eEDak9wSJH_y1Q=@pm.me>
 <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com>
Message-ID: <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org>

Hi everybody and happy new year,

We let this thread about the sprint die. I hope that this didn't change
people's plans.

So, it seems that the week of Feb 25th is a good week. I'll assume that
it's good for most and start planning from there (if it's not the case,
let me know).

I've started our classic sprint-planing wiki page:
https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events 
It's not rocket science, but it's better than an email thread to keep
information together.

It would be great if people could add their name, and if they need
funding. We need to evaluate if we need to find funding.

Also, it's quite soon, so maybe it would be good to start planning
accommodation and travel :$.

Cheers,

Ga?l

On Sat, Dec 22, 2018 at 05:27:39PM +0100, Guillaume Lema?tre wrote:
> Works for me as well. 

> Sent from my phone - sorry to be brief and potential misspell.


> ? Original Message ?
> From: scikit-learn at python.org
> Sent: 22 December 2018 17:17
> To: scikit-learn at python.org
> Reply to: rth.yurchak at pm.me; scikit-learn at python.org
> Cc: rth.yurchak at pm.me
> Subject: Re: [scikit-learn] Next Sprint

> That works for me as well.

> On 21/12/2018 16:00, Olivier Grisel wrote:
> > Ok for me. The last 3 weeks of February are fine for me.

> > Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort 
> > <alexandre.gramfort at inria.fr <mailto:alexandre.gramfort at inria.fr>> a ?crit?:

> >???? ok for me

> >???? Alex

> >???? On Thu, Dec 20, 2018 at 8:35 PM Adrin <adrin.jalali at gmail.com
> >???? <mailto:adrin.jalali at gmail.com>> wrote:
> >????? >
> >????? > It'll be the least favourable week of February for me, but I can
> >???? make do.
> >????? >
> >????? > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller <t3kcit at gmail.com
> >???? <mailto:t3kcit at gmail.com>> wrote:
> >????? >>
> >????? >> Works for me!
> >????? >>
> >????? >> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
> >????? >> > I would propose? the week of Feb 25th, as I heard people say
> >???? that they
> >????? >> > might be available at this time. It is good for many people,
> >???? or should we
> >????? >> > organize a doodle?
> >????? >> >
> >????? >> > G
> >????? >> >
> >????? >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
> >????? >> >> Can we please nail down dates for a sprint?
> >????? >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
> >????? >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
> >????? >> >>>> We can also do Paris in April / May or June if that's ok
> >???? with Joel and better
> >????? >> >>>> for Andreas.
> >????? >> >>> Absolutely.
> >????? >> >>> My thoughts here are that I want to minimize transportation,
> >???? partly
> >????? >> >>> because flying has a large carbon footprint. Also, for
> >???? personal reasons,
> >????? >> >>> I am not sure that I will be able to make it to Austin in
> >???? July, but I
> >????? >> >>> realize that this is a pretty bad argument.
> >????? >> >>> We're happy to try to host in Paris whenever it's most
> >???? convenient and to
> >????? >> >>> try to help with travel for those not in Paris.
> >????? >> >>> Ga?l
> >????? >> >>> _______________________________________________
> >????? >> >>> scikit-learn mailing list
> >????? >> >>> scikit-learn at python.org <mailto:scikit-learn at python.org>
> >????? >> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >????? >> >> _______________________________________________
> >????? >> >> scikit-learn mailing list
> >????? >> >> scikit-learn at python.org <mailto:scikit-learn at python.org>
> >????? >> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >????? >>
> >????? >> _______________________________________________
> >????? >> scikit-learn mailing list
> >????? >> scikit-learn at python.org <mailto:scikit-learn at python.org>
> >????? >> https://mail.python.org/mailman/listinfo/scikit-learn
> >????? >
> >????? > _______________________________________________
> >????? > scikit-learn mailing list
> >????? > scikit-learn at python.org <mailto:scikit-learn at python.org>
> >????? > https://mail.python.org/mailman/listinfo/scikit-learn
> >???? _______________________________________________
> >???? scikit-learn mailing list
> >???? scikit-learn at python.org <mailto:scikit-learn at python.org>
> >???? https://mail.python.org/mailman/listinfo/scikit-learn


> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From pisymbol at gmail.com  Mon Jan  7 23:50:49 2019
From: pisymbol at gmail.com (pisymbol)
Date: Mon, 7 Jan 2019 23:50:49 -0500
Subject: [scikit-learn] LogisticRegression coef_ greater than n_features?
Message-ID: <CAPpy_fHR-=wAaGhE-uYC8D3ku1L3r5=TOUgpncOPq-8ODr46ag@mail.gmail.com>

According to the doc (0.20.2) the coef_ variables are suppose to be shape
(1, n_features) for binary classification. Well I created a Pipeline and
performed a GridSearchCV to create a LogisticRegresion model that does
fairly well. However, when I want to rank feature importance I noticed that
my coefs_ for my best_estimator_ has 24 entries while my training data has
22.

What am I missing? How could coef_ > n_features?

-aps
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190107/8a4ec8f1/attachment.html>

From pisymbol at gmail.com  Tue Jan  8 00:02:17 2019
From: pisymbol at gmail.com (pisymbol)
Date: Tue, 8 Jan 2019 00:02:17 -0500
Subject: [scikit-learn] LogisticRegression coef_ greater than n_features?
In-Reply-To: <CAPpy_fHR-=wAaGhE-uYC8D3ku1L3r5=TOUgpncOPq-8ODr46ag@mail.gmail.com>
References: <CAPpy_fHR-=wAaGhE-uYC8D3ku1L3r5=TOUgpncOPq-8ODr46ag@mail.gmail.com>
Message-ID: <CAPpy_fGXG5wA7O=_CaWpnxMAfNcmrEVWo8Ri2FCLth=tA3=bzg@mail.gmail.com>

On Mon, Jan 7, 2019 at 11:50 PM pisymbol <pisymbol at gmail.com> wrote:

> According to the doc (0.20.2) the coef_ variables are suppose to be shape
> (1, n_features) for binary classification. Well I created a Pipeline and
> performed a GridSearchCV to create a LogisticRegresion model that does
> fairly well. However, when I want to rank feature importance I noticed that
> my coefs_ for my best_estimator_ has 24 entries while my training data has
> 22.
>
> What am I missing? How could coef_ > n_features?
>
>
Just a follow-up, I am using a OneHotEncoder to encode two categoricals as
part of my pipeline (I am also using an imputer/standard scaler too but I
don't see how that could add features).

Could my pipeline actually add two more features during fitting?

-aps
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190108/05018e60/attachment.html>

From mail at sebastianraschka.com  Mon Jan  7 23:54:50 2019
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Mon, 7 Jan 2019 22:54:50 -0600
Subject: [scikit-learn] LogisticRegression coef_ greater than n_features?
In-Reply-To: <CAPpy_fHR-=wAaGhE-uYC8D3ku1L3r5=TOUgpncOPq-8ODr46ag@mail.gmail.com>
References: <CAPpy_fHR-=wAaGhE-uYC8D3ku1L3r5=TOUgpncOPq-8ODr46ag@mail.gmail.com>
Message-ID: <2A93A0B0-359D-4C30-9ED7-2A166926E0F6@sebastianraschka.com>

Maybe check 

a) if the actual labels of the training examples don't start at 0
b) if you have gaps, e.g,. if your unique training labels are 0, 1, 4, ..., 23

Best,
Sebastian

> On Jan 7, 2019, at 10:50 PM, pisymbol <pisymbol at gmail.com> wrote:
> 
> According to the doc (0.20.2) the coef_ variables are suppose to be shape (1, n_features) for binary classification. Well I created a Pipeline and performed a GridSearchCV to create a LogisticRegresion model that does fairly well. However, when I want to rank feature importance I noticed that my coefs_ for my best_estimator_ has 24 entries while my training data has 22.
> 
> What am I missing? How could coef_ > n_features?
> 
> -aps
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From mail at sebastianraschka.com  Tue Jan  8 00:32:22 2019
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Mon, 7 Jan 2019 23:32:22 -0600
Subject: [scikit-learn] LogisticRegression coef_ greater than n_features?
In-Reply-To: <CAPpy_fGXG5wA7O=_CaWpnxMAfNcmrEVWo8Ri2FCLth=tA3=bzg@mail.gmail.com>
References: <CAPpy_fHR-=wAaGhE-uYC8D3ku1L3r5=TOUgpncOPq-8ODr46ag@mail.gmail.com>
 <CAPpy_fGXG5wA7O=_CaWpnxMAfNcmrEVWo8Ri2FCLth=tA3=bzg@mail.gmail.com>
Message-ID: <1061B4E0-615B-4658-B8F6-946D7CBAAD94@sebastianraschka.com>

E.g, if you have a feature with values 'a' , 'b', 'c', then applying the one hot encoder will transform this into 3 features.

Best,
Sebastian

> On Jan 7, 2019, at 11:02 PM, pisymbol <pisymbol at gmail.com> wrote:
> 
> 
> 
> On Mon, Jan 7, 2019 at 11:50 PM pisymbol <pisymbol at gmail.com> wrote:
> According to the doc (0.20.2) the coef_ variables are suppose to be shape (1, n_features) for binary classification. Well I created a Pipeline and performed a GridSearchCV to create a LogisticRegresion model that does fairly well. However, when I want to rank feature importance I noticed that my coefs_ for my best_estimator_ has 24 entries while my training data has 22.
> 
> What am I missing? How could coef_ > n_features?
> 
> 
> Just a follow-up, I am using a OneHotEncoder to encode two categoricals as part of my pipeline (I am also using an imputer/standard scaler too but I don't see how that could add features).
> 
> Could my pipeline actually add two more features during fitting?
> 
> -aps
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From qinhanmin2005 at sina.com  Tue Jan  8 08:13:39 2019
From: qinhanmin2005 at sina.com (Hanmin Qin)
Date: Tue, 08 Jan 2019 21:13:39 +0800
Subject: [scikit-learn] Next Sprint
Message-ID: <20190108131339.3B9965D0009B@webmail.sinamail.sina.com.cn>

Apologies I won't be available because of school work.Thanks the whole community for your great help. I'll continue to contribute and keep online during the sprint.
Hanmin Qin
----- Original Message -----
From: Gael Varoquaux <gael.varoquaux at normalesup.org>
To: Scikit-learn mailing list <scikit-learn at python.org>
Subject: Re: [scikit-learn] Next Sprint
Date: 2019-01-08 05:40


Hi everybody and happy new year,
We let this thread about the sprint die. I hope that this didn't change
people's plans.
So, it seems that the week of Feb 25th is a good week. I'll assume that
it's good for most and start planning from there (if it's not the case,
let me know).
I've started our classic sprint-planing wiki page:
https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events 
It's not rocket science, but it's better than an email thread to keep
information together.
It would be great if people could add their name, and if they need
funding. We need to evaluate if we need to find funding.
Also, it's quite soon, so maybe it would be good to start planning
accommodation and travel :$.
Cheers,
Ga?l
On Sat, Dec 22, 2018 at 05:27:39PM +0100, Guillaume Lema?tre wrote:
> Works for me as well. 
> Sent from my phone - sorry to be brief and potential misspell.
>   Original Message  
> From: scikit-learn at python.org
> Sent: 22 December 2018 17:17
> To: scikit-learn at python.org
> Reply to: rth.yurchak at pm.me; scikit-learn at python.org
> Cc: rth.yurchak at pm.me
> Subject: Re: [scikit-learn] Next Sprint
> That works for me as well.
> On 21/12/2018 16:00, Olivier Grisel wrote:
> > Ok for me. The last 3 weeks of February are fine for me.
> > Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort 
> > <alexandre.gramfort at inria.fr <mailto:alexandre.gramfort at inria.fr>> a ?crit :
> >     ok for me
> >     Alex
> >     On Thu, Dec 20, 2018 at 8:35 PM Adrin <adrin.jalali at gmail.com
> >     <mailto:adrin.jalali at gmail.com>> wrote:
> >      >
> >      > It'll be the least favourable week of February for me, but I can
> >     make do.
> >      >
> >      > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller <t3kcit at gmail.com
> >     <mailto:t3kcit at gmail.com>> wrote:
> >      >>
> >      >> Works for me!
> >      >>
> >      >> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
> >      >> > I would propose  the week of Feb 25th, as I heard people say
> >     that they
> >      >> > might be available at this time. It is good for many people,
> >     or should we
> >      >> > organize a doodle?
> >      >> >
> >      >> > G
> >      >> >
> >      >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
> >      >> >> Can we please nail down dates for a sprint?
> >      >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
> >      >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
> >      >> >>>> We can also do Paris in April / May or June if that's ok
> >     with Joel and better
> >      >> >>>> for Andreas.
> >      >> >>> Absolutely.
> >      >> >>> My thoughts here are that I want to minimize transportation,
> >     partly
> >      >> >>> because flying has a large carbon footprint. Also, for
> >     personal reasons,
> >      >> >>> I am not sure that I will be able to make it to Austin in
> >     July, but I
> >      >> >>> realize that this is a pretty bad argument.
> >      >> >>> We're happy to try to host in Paris whenever it's most
> >     convenient and to
> >      >> >>> try to help with travel for those not in Paris.
> >      >> >>> Ga?l
> >      >> >>> _______________________________________________
> >      >> >>> scikit-learn mailing list
> >      >> >>> scikit-learn at python.org <mailto:scikit-learn at python.org>
> >      >> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> >      >> >> _______________________________________________
> >      >> >> scikit-learn mailing list
> >      >> >> scikit-learn at python.org <mailto:scikit-learn at python.org>
> >      >> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >      >>
> >      >> _______________________________________________
> >      >> scikit-learn mailing list
> >      >> scikit-learn at python.org <mailto:scikit-learn at python.org>
> >      >> https://mail.python.org/mailman/listinfo/scikit-learn
> >      >
> >      > _______________________________________________
> >      > scikit-learn mailing list
> >      > scikit-learn at python.org <mailto:scikit-learn at python.org>
> >      > https://mail.python.org/mailman/listinfo/scikit-learn
> >     _______________________________________________
> >     scikit-learn mailing list
> >     scikit-learn at python.org <mailto:scikit-learn at python.org>
> >     https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190108/7f624df6/attachment.html>

From astha31agarwal at gmail.com  Tue Jan  8 09:26:25 2019
From: astha31agarwal at gmail.com (Astha Agarwal)
Date: Tue, 8 Jan 2019 09:26:25 -0500
Subject: [scikit-learn] Using sklearn-crfsuite on Production Systems
Message-ID: <CAMNDLet1=cfc2_JO4qC7qBiU9Uhx2c7mmfzOvvrBbKopDjDqQQ@mail.gmail.com>

Hi,

I'm wondering if anyone is using sklearn-crfsuite on production systems?

Is this library suitable for usage in industry on production systems (and
not academia) for non-big data problems?

Thanks,
Astha
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190108/13df56a7/attachment.html>

From pisymbol at gmail.com  Tue Jan  8 09:51:20 2019
From: pisymbol at gmail.com (pisymbol)
Date: Tue, 8 Jan 2019 09:51:20 -0500
Subject: [scikit-learn] LogisticRegression coef_ greater than n_features?
In-Reply-To: <1061B4E0-615B-4658-B8F6-946D7CBAAD94@sebastianraschka.com>
References: <CAPpy_fHR-=wAaGhE-uYC8D3ku1L3r5=TOUgpncOPq-8ODr46ag@mail.gmail.com>
 <CAPpy_fGXG5wA7O=_CaWpnxMAfNcmrEVWo8Ri2FCLth=tA3=bzg@mail.gmail.com>
 <1061B4E0-615B-4658-B8F6-946D7CBAAD94@sebastianraschka.com>
Message-ID: <CAPpy_fHR74VKA_ZWi4iyn_dpvrrNaRCJkp0h5Mao3ZXpPN27Sw@mail.gmail.com>

If that is the case, what order are the coefficients in then?

-aps

On Tue, Jan 8, 2019 at 12:48 AM Sebastian Raschka <mail at sebastianraschka.com>
wrote:

> E.g, if you have a feature with values 'a' , 'b', 'c', then applying the
> one hot encoder will transform this into 3 features.
>
> Best,
> Sebastian
>
> > On Jan 7, 2019, at 11:02 PM, pisymbol <pisymbol at gmail.com> wrote:
> >
> >
> >
> > On Mon, Jan 7, 2019 at 11:50 PM pisymbol <pisymbol at gmail.com> wrote:
> > According to the doc (0.20.2) the coef_ variables are suppose to be
> shape (1, n_features) for binary classification. Well I created a Pipeline
> and performed a GridSearchCV to create a LogisticRegresion model that does
> fairly well. However, when I want to rank feature importance I noticed that
> my coefs_ for my best_estimator_ has 24 entries while my training data has
> 22.
> >
> > What am I missing? How could coef_ > n_features?
> >
> >
> > Just a follow-up, I am using a OneHotEncoder to encode two categoricals
> as part of my pipeline (I am also using an imputer/standard scaler too but
> I don't see how that could add features).
> >
> > Could my pipeline actually add two more features during fitting?
> >
> > -aps
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190108/221527d9/attachment-0001.html>

From pisymbol at gmail.com  Tue Jan  8 10:33:04 2019
From: pisymbol at gmail.com (pisymbol)
Date: Tue, 8 Jan 2019 10:33:04 -0500
Subject: [scikit-learn] LogisticRegression coef_ greater than n_features?
In-Reply-To: <CAPpy_fHR74VKA_ZWi4iyn_dpvrrNaRCJkp0h5Mao3ZXpPN27Sw@mail.gmail.com>
References: <CAPpy_fHR-=wAaGhE-uYC8D3ku1L3r5=TOUgpncOPq-8ODr46ag@mail.gmail.com>
 <CAPpy_fGXG5wA7O=_CaWpnxMAfNcmrEVWo8Ri2FCLth=tA3=bzg@mail.gmail.com>
 <1061B4E0-615B-4658-B8F6-946D7CBAAD94@sebastianraschka.com>
 <CAPpy_fHR74VKA_ZWi4iyn_dpvrrNaRCJkp0h5Mao3ZXpPN27Sw@mail.gmail.com>
Message-ID: <CAPpy_fEfGCDwpL5GzXac4DxW4=5ZJXoWjk4XhN0OZkrs+o3Q0g@mail.gmail.com>

Also Sebastian, I have binary classes but they are strings:

clf.classes_:

array(['American', 'Southwest'], dtype=object)


On Tue, Jan 8, 2019 at 9:51 AM pisymbol <pisymbol at gmail.com> wrote:

> If that is the case, what order are the coefficients in then?
>
> -aps
>
> On Tue, Jan 8, 2019 at 12:48 AM Sebastian Raschka <
> mail at sebastianraschka.com> wrote:
>
>> E.g, if you have a feature with values 'a' , 'b', 'c', then applying the
>> one hot encoder will transform this into 3 features.
>>
>> Best,
>> Sebastian
>>
>> > On Jan 7, 2019, at 11:02 PM, pisymbol <pisymbol at gmail.com> wrote:
>> >
>> >
>> >
>> > On Mon, Jan 7, 2019 at 11:50 PM pisymbol <pisymbol at gmail.com> wrote:
>> > According to the doc (0.20.2) the coef_ variables are suppose to be
>> shape (1, n_features) for binary classification. Well I created a Pipeline
>> and performed a GridSearchCV to create a LogisticRegresion model that does
>> fairly well. However, when I want to rank feature importance I noticed that
>> my coefs_ for my best_estimator_ has 24 entries while my training data has
>> 22.
>> >
>> > What am I missing? How could coef_ > n_features?
>> >
>> >
>> > Just a follow-up, I am using a OneHotEncoder to encode two categoricals
>> as part of my pipeline (I am also using an imputer/standard scaler too but
>> I don't see how that could add features).
>> >
>> > Could my pipeline actually add two more features during fitting?
>> >
>> > -aps
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190108/3a67d466/attachment.html>

From mail at sebastianraschka.com  Tue Jan  8 20:07:03 2019
From: mail at sebastianraschka.com (Sebastian Raschka)
Date: Tue, 8 Jan 2019 19:07:03 -0600
Subject: [scikit-learn] LogisticRegression coef_ greater than n_features?
In-Reply-To: <CAPpy_fEfGCDwpL5GzXac4DxW4=5ZJXoWjk4XhN0OZkrs+o3Q0g@mail.gmail.com>
References: <CAPpy_fHR-=wAaGhE-uYC8D3ku1L3r5=TOUgpncOPq-8ODr46ag@mail.gmail.com>
 <CAPpy_fGXG5wA7O=_CaWpnxMAfNcmrEVWo8Ri2FCLth=tA3=bzg@mail.gmail.com>
 <1061B4E0-615B-4658-B8F6-946D7CBAAD94@sebastianraschka.com>
 <CAPpy_fHR74VKA_ZWi4iyn_dpvrrNaRCJkp0h5Mao3ZXpPN27Sw@mail.gmail.com>
 <CAPpy_fEfGCDwpL5GzXac4DxW4=5ZJXoWjk4XhN0OZkrs+o3Q0g@mail.gmail.com>
Message-ID: <4A92B9C5-9E20-48CE-A42A-261ABE720505@sebastianraschka.com>

It seems like it's determined by the order in which they occur in the training set. E.g.,

from sklearn.preprocessing import OneHotEncoder
import numpy as np

x = np.array([['b'],
              ['a'], 
              ['b']])
ohe = OneHotEncoder()
xt = ohe.fit_transform(x)
xt.todense()

matrix([[0., 1.],
        [1., 0.],
        [0., 1.]])


and

x = np.array([['a'],
              ['b'], 
              ['a']])
ohe = OneHotEncoder()
xt = ohe.fit_transform(x)
xt.todense()

matrix([[1., 0.],
        [0., 1.],
        [1., 0.]])

Not sure how you used the OHE, but you also want to make sure that you only use it on those columns that are indeed categorical, e.g., note the following behavior: 

x = np.array([['a', 1.1],
              ['b', 1.2], 
              ['a', 1.3]])
ohe = OneHotEncoder()
xt = ohe.fit_transform(x)
xt.todense()

matrix([[1., 0., 1., 0., 0.],
        [0., 1., 0., 1., 0.],
        [1., 0., 0., 0., 1.]])


Best,
Sebastian

> On Jan 8, 2019, at 9:33 AM, pisymbol <pisymbol at gmail.com> wrote:
> 
> Also Sebastian, I have binary classes but they are strings:
> 
> clf.classes_:
> array(['American', 'Southwest'], dtype=object)
> 
> 
> 
> On Tue, Jan 8, 2019 at 9:51 AM pisymbol <pisymbol at gmail.com> wrote:
> If that is the case, what order are the coefficients in then?
> 
> -aps
> 
> On Tue, Jan 8, 2019 at 12:48 AM Sebastian Raschka <mail at sebastianraschka.com> wrote:
> E.g, if you have a feature with values 'a' , 'b', 'c', then applying the one hot encoder will transform this into 3 features.
> 
> Best,
> Sebastian
> 
> > On Jan 7, 2019, at 11:02 PM, pisymbol <pisymbol at gmail.com> wrote:
> > 
> > 
> > 
> > On Mon, Jan 7, 2019 at 11:50 PM pisymbol <pisymbol at gmail.com> wrote:
> > According to the doc (0.20.2) the coef_ variables are suppose to be shape (1, n_features) for binary classification. Well I created a Pipeline and performed a GridSearchCV to create a LogisticRegresion model that does fairly well. However, when I want to rank feature importance I noticed that my coefs_ for my best_estimator_ has 24 entries while my training data has 22.
> > 
> > What am I missing? How could coef_ > n_features?
> > 
> > 
> > Just a follow-up, I am using a OneHotEncoder to encode two categoricals as part of my pipeline (I am also using an imputer/standard scaler too but I don't see how that could add features).
> > 
> > Could my pipeline actually add two more features during fitting?
> > 
> > -aps
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From pahome.chen at mirlab.org  Tue Jan  8 20:23:32 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Wed, 9 Jan 2019 09:23:32 +0800
Subject: [scikit-learn] Does sklearn contain xgboost?
Message-ID: <CAB3eZftOCBanfJ4sAY2p4vV74ecB3=atD+fzzSZiTTxO2Uje-w@mail.gmail.com>

As title

Does sklearn contain xgboost to use?

thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190109/43de0b5d/attachment.html>

From niourf at gmail.com  Tue Jan  8 21:03:01 2019
From: niourf at gmail.com (Nicolas Hug)
Date: Tue, 8 Jan 2019 21:03:01 -0500
Subject: [scikit-learn] Does sklearn contain xgboost?
In-Reply-To: <CAB3eZftOCBanfJ4sAY2p4vV74ecB3=atD+fzzSZiTTxO2Uje-w@mail.gmail.com>
References: <CAB3eZftOCBanfJ4sAY2p4vV74ecB3=atD+fzzSZiTTxO2Uje-w@mail.gmail.com>
Message-ID: <1f0c4259-6e73-61ab-f6d1-a16ef7b5811f@gmail.com>

XGBoost is a specific implementation of gradient boosting trees, so 
strictly speaking scikit-learn cannot "contain" XGBoost. That being said:

- XGBoost has a scikit-learn compatible API: 
https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn. 
So does LightGBM, another fast implementation of gradient boosting trees.

- scikit-learn implements "vanilla" gradient boosting 
https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting

- There's an open PR in scikit learn (still very WIP) that implements 
the same kind of optimization that XGBoost and LightGBM use, which will 
make GBDT faster https://github.com/scikit-learn/scikit-learn/pull/12807.


Nicolas

On 1/8/19 8:23 PM, lampahome wrote:
> As title
>
> Does sklearn contain xgboost to use?
>
> thanks
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190108/cc29a9ce/attachment.html>

From t3kcit at gmail.com  Wed Jan  9 14:09:58 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Wed, 9 Jan 2019 14:09:58 -0500
Subject: [scikit-learn] Next Sprint
In-Reply-To: <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org>
References: <RX6iIIyFIwuS7OUZgAO0I2OvQSl3IJ2D4ZBdJC2zO0i8gXkHwDMq_3oP1KtYtaamqgIcLhdX4CQCAJIj2_4oTdqT_wYL5eEDak9wSJH_y1Q=@pm.me>
 <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com>
 <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org>
Message-ID: <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com>

Great, thanks for finalizing!
It would be good to get some vague estimate of funding.
I can probably provide some, though I'm in the process of hiring Thomas Fan,
which might tie up some of my funds.

Ga?l, does the foundation have funds and do you want to use them?
And/or do you/INRA have funds you want to use?


On 1/7/19 4:38 PM, Gael Varoquaux wrote:
> Hi everybody and happy new year,
>
> We let this thread about the sprint die. I hope that this didn't change
> people's plans.
>
> So, it seems that the week of Feb 25th is a good week. I'll assume that
> it's good for most and start planning from there (if it's not the case,
> let me know).
>
> I've started our classic sprint-planing wiki page:
> https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events
> It's not rocket science, but it's better than an email thread to keep
> information together.
>
> It would be great if people could add their name, and if they need
> funding. We need to evaluate if we need to find funding.
>
> Also, it's quite soon, so maybe it would be good to start planning
> accommodation and travel :$.
>
> Cheers,
>
> Ga?l
>
> On Sat, Dec 22, 2018 at 05:27:39PM +0100, Guillaume Lema?tre wrote:
>> Works for me as well.
>> Sent from my phone - sorry to be brief and potential misspell.
>
>>  ? Original Message
>> From: scikit-learn at python.org
>> Sent: 22 December 2018 17:17
>> To: scikit-learn at python.org
>> Reply to: rth.yurchak at pm.me; scikit-learn at python.org
>> Cc: rth.yurchak at pm.me
>> Subject: Re: [scikit-learn] Next Sprint
>> That works for me as well.
>> On 21/12/2018 16:00, Olivier Grisel wrote:
>>> Ok for me. The last 3 weeks of February are fine for me.
>>> Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort
>>> <alexandre.gramfort at inria.fr <mailto:alexandre.gramfort at inria.fr>> a ?crit?:
>>>  ???? ok for me
>>>  ???? Alex
>>>  ???? On Thu, Dec 20, 2018 at 8:35 PM Adrin <adrin.jalali at gmail.com
>>>  ???? <mailto:adrin.jalali at gmail.com>> wrote:
>>>  ????? >
>>>  ????? > It'll be the least favourable week of February for me, but I can
>>>  ???? make do.
>>>  ????? >
>>>  ????? > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller <t3kcit at gmail.com
>>>  ???? <mailto:t3kcit at gmail.com>> wrote:
>>>  ????? >>
>>>  ????? >> Works for me!
>>>  ????? >>
>>>  ????? >> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
>>>  ????? >> > I would propose? the week of Feb 25th, as I heard people say
>>>  ???? that they
>>>  ????? >> > might be available at this time. It is good for many people,
>>>  ???? or should we
>>>  ????? >> > organize a doodle?
>>>  ????? >> >
>>>  ????? >> > G
>>>  ????? >> >
>>>  ????? >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
>>>  ????? >> >> Can we please nail down dates for a sprint?
>>>  ????? >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
>>>  ????? >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
>>>  ????? >> >>>> We can also do Paris in April / May or June if that's ok
>>>  ???? with Joel and better
>>>  ????? >> >>>> for Andreas.
>>>  ????? >> >>> Absolutely.
>>>  ????? >> >>> My thoughts here are that I want to minimize transportation,
>>>  ???? partly
>>>  ????? >> >>> because flying has a large carbon footprint. Also, for
>>>  ???? personal reasons,
>>>  ????? >> >>> I am not sure that I will be able to make it to Austin in
>>>  ???? July, but I
>>>  ????? >> >>> realize that this is a pretty bad argument.
>>>  ????? >> >>> We're happy to try to host in Paris whenever it's most
>>>  ???? convenient and to
>>>  ????? >> >>> try to help with travel for those not in Paris.
>>>  ????? >> >>> Ga?l
>>>  ????? >> >>> _______________________________________________
>>>  ????? >> >>> scikit-learn mailing list
>>>  ????? >> >>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>>>  ????? >> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>  ????? >> >> _______________________________________________
>>>  ????? >> >> scikit-learn mailing list
>>>  ????? >> >> scikit-learn at python.org <mailto:scikit-learn at python.org>
>>>  ????? >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>>>  ????? >>
>>>  ????? >> _______________________________________________
>>>  ????? >> scikit-learn mailing list
>>>  ????? >> scikit-learn at python.org <mailto:scikit-learn at python.org>
>>>  ????? >> https://mail.python.org/mailman/listinfo/scikit-learn
>>>  ????? >
>>>  ????? > _______________________________________________
>>>  ????? > scikit-learn mailing list
>>>  ????? > scikit-learn at python.org <mailto:scikit-learn at python.org>
>>>  ????? > https://mail.python.org/mailman/listinfo/scikit-learn
>>>  ???? _______________________________________________
>>>  ???? scikit-learn mailing list
>>>  ???? scikit-learn at python.org <mailto:scikit-learn at python.org>
>>>  ???? https://mail.python.org/mailman/listinfo/scikit-learn
>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn


From pahome.chen at mirlab.org  Thu Jan 10 03:47:14 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Thu, 10 Jan 2019 16:47:14 +0800
Subject: [scikit-learn] Any clustering algo to cluster by the ratio of
 series data ?
Message-ID: <CAB3eZfshTO39f8hvAa+HL9e7ZU0JmTNUFQezvtZnzoX+r=NtPQ@mail.gmail.com>

Cluster algo cluster samples by calculating the euclidean distance.
I wonder if any clustering algo can cluster the series data?

EX:
Every items has there sold numbers of everyday.
Item,Day1,Day2,Day3,Day4,Day5
A,1,5,1,5,1
B,10,50,10,50,10,
C,4,70,30,10,50

The difference ratio of A and B are 500%,20%,500%,20%,
I want to make A&B be the same cluster, C is another one.

If I don't want to calculate the difference ratio of each samples

Is there anyway to cluster by the difference ratio of each samples?

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190110/047893d6/attachment.html>

From gael.varoquaux at normalesup.org  Thu Jan 10 10:34:08 2019
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 10 Jan 2019 16:34:08 +0100
Subject: [scikit-learn] Next Sprint
In-Reply-To: <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com>
References: <RX6iIIyFIwuS7OUZgAO0I2OvQSl3IJ2D4ZBdJC2zO0i8gXkHwDMq_3oP1KtYtaamqgIcLhdX4CQCAJIj2_4oTdqT_wYL5eEDak9wSJH_y1Q=@pm.me>
 <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com>
 <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org>
 <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com>
Message-ID: <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org>

On Wed, Jan 09, 2019 at 02:09:58PM -0500, Andreas Mueller wrote:
> Ga?l, does the foundation have funds and do you want to use them?
> And/or do you/INRA have funds you want to use?

Neither myself nor Inria has fund to use outside the foundation. The
foundation can commit money if needed. We tend to prefer spending it on
paying senior people to work on the project, as it is the bottleneck (we
are still recruiting, by the way), but such a sprint is important.

We will also apply for sprint-specific funding sources. If we can
lighten-up your budget, so that you can pay awesome people to work on the
project, it is a good thing.

Ga?l


> On 1/7/19 4:38 PM, Gael Varoquaux wrote:
> > Hi everybody and happy new year,

> > We let this thread about the sprint die. I hope that this didn't change
> > people's plans.

> > So, it seems that the week of Feb 25th is a good week. I'll assume that
> > it's good for most and start planning from there (if it's not the case,
> > let me know).

> > I've started our classic sprint-planing wiki page:
> > https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events
> > It's not rocket science, but it's better than an email thread to keep
> > information together.

> > It would be great if people could add their name, and if they need
> > funding. We need to evaluate if we need to find funding.

> > Also, it's quite soon, so maybe it would be good to start planning
> > accommodation and travel :$.

> > Cheers,

> > Ga?l

> > On Sat, Dec 22, 2018 at 05:27:39PM +0100, Guillaume Lema?tre wrote:
> > > Works for me as well.
> > > Sent from my phone - sorry to be brief and potential misspell.

> > >  ? Original Message
> > > From: scikit-learn at python.org
> > > Sent: 22 December 2018 17:17
> > > To: scikit-learn at python.org
> > > Reply to: rth.yurchak at pm.me; scikit-learn at python.org
> > > Cc: rth.yurchak at pm.me
> > > Subject: Re: [scikit-learn] Next Sprint
> > > That works for me as well.
> > > On 21/12/2018 16:00, Olivier Grisel wrote:
> > > > Ok for me. The last 3 weeks of February are fine for me.
> > > > Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort
> > > > <alexandre.gramfort at inria.fr <mailto:alexandre.gramfort at inria.fr>> a ?crit?:
> > > >  ???? ok for me
> > > >  ???? Alex
> > > >  ???? On Thu, Dec 20, 2018 at 8:35 PM Adrin <adrin.jalali at gmail.com
> > > >  ???? <mailto:adrin.jalali at gmail.com>> wrote:
> > > >  ????? >
> > > >  ????? > It'll be the least favourable week of February for me, but I can
> > > >  ???? make do.
> > > >  ????? >
> > > >  ????? > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller <t3kcit at gmail.com
> > > >  ???? <mailto:t3kcit at gmail.com>> wrote:
> > > >  ????? >>
> > > >  ????? >> Works for me!
> > > >  ????? >>
> > > >  ????? >> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
> > > >  ????? >> > I would propose? the week of Feb 25th, as I heard people say
> > > >  ???? that they
> > > >  ????? >> > might be available at this time. It is good for many people,
> > > >  ???? or should we
> > > >  ????? >> > organize a doodle?
> > > >  ????? >> >
> > > >  ????? >> > G
> > > >  ????? >> >
> > > >  ????? >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
> > > >  ????? >> >> Can we please nail down dates for a sprint?
> > > >  ????? >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
> > > >  ????? >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
> > > >  ????? >> >>>> We can also do Paris in April / May or June if that's ok
> > > >  ???? with Joel and better
> > > >  ????? >> >>>> for Andreas.
> > > >  ????? >> >>> Absolutely.
> > > >  ????? >> >>> My thoughts here are that I want to minimize transportation,
> > > >  ???? partly
> > > >  ????? >> >>> because flying has a large carbon footprint. Also, for
> > > >  ???? personal reasons,
> > > >  ????? >> >>> I am not sure that I will be able to make it to Austin in
> > > >  ???? July, but I
> > > >  ????? >> >>> realize that this is a pretty bad argument.
> > > >  ????? >> >>> We're happy to try to host in Paris whenever it's most
> > > >  ???? convenient and to
> > > >  ????? >> >>> try to help with travel for those not in Paris.
> > > >  ????? >> >>> Ga?l
> > > >  ????? >> >>> _______________________________________________
> > > >  ????? >> >>> scikit-learn mailing list
> > > >  ????? >> >>> scikit-learn at python.org <mailto:scikit-learn at python.org>
> > > >  ????? >> >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >  ????? >> >> _______________________________________________
> > > >  ????? >> >> scikit-learn mailing list
> > > >  ????? >> >> scikit-learn at python.org <mailto:scikit-learn at python.org>
> > > >  ????? >> >> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >  ????? >>
> > > >  ????? >> _______________________________________________
> > > >  ????? >> scikit-learn mailing list
> > > >  ????? >> scikit-learn at python.org <mailto:scikit-learn at python.org>
> > > >  ????? >> https://mail.python.org/mailman/listinfo/scikit-learn
> > > >  ????? >
> > > >  ????? > _______________________________________________
> > > >  ????? > scikit-learn mailing list
> > > >  ????? > scikit-learn at python.org <mailto:scikit-learn at python.org>
> > > >  ????? > https://mail.python.org/mailman/listinfo/scikit-learn
> > > >  ???? _______________________________________________
> > > >  ???? scikit-learn mailing list
> > > >  ???? scikit-learn at python.org <mailto:scikit-learn at python.org>
> > > >  ???? https://mail.python.org/mailman/listinfo/scikit-learn

> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn at python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From t3kcit at gmail.com  Thu Jan 10 12:32:17 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 10 Jan 2019 12:32:17 -0500
Subject: [scikit-learn] Next Sprint
In-Reply-To: <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org>
References: <RX6iIIyFIwuS7OUZgAO0I2OvQSl3IJ2D4ZBdJC2zO0i8gXkHwDMq_3oP1KtYtaamqgIcLhdX4CQCAJIj2_4oTdqT_wYL5eEDak9wSJH_y1Q=@pm.me>
 <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com>
 <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org>
 <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com>
 <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org>
Message-ID: <4e421d64-2ede-80ba-932a-e366b515133d@gmail.com>


On 1/10/19 10:34 AM, Gael Varoquaux wrote:
> On Wed, Jan 09, 2019 at 02:09:58PM -0500, Andreas Mueller wrote:
>> Ga?l, does the foundation have funds and do you want to use them?
>> And/or do you/INRA have funds you want to use?
> Neither myself nor Inria has fund to use outside the foundation. The
> foundation can commit money if needed. We tend to prefer spending it on
> paying senior people to work on the project, as it is the bottleneck (we
> are still recruiting, by the way), but such a sprint is important.
>
> We will also apply for sprint-specific funding sources. If we can
> lighten-up your budget, so that you can pay awesome people to work on the
> project, it is a good thing.
>
Ok good to know. And I totally agree about using foundation money
to pay senior people.
Though discussion time between senior people is also a serious 
bottleneck imho ;)

Any sprint specific funding you're thinking of? Google gave in the past, 
right?
I could cold-email some people (two sigma, bloomberg?) but not sure if 
that's very promising.

From gael.varoquaux at normalesup.org  Thu Jan 10 12:36:22 2019
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 10 Jan 2019 18:36:22 +0100
Subject: [scikit-learn] Next Sprint
In-Reply-To: <4e421d64-2ede-80ba-932a-e366b515133d@gmail.com>
References: <RX6iIIyFIwuS7OUZgAO0I2OvQSl3IJ2D4ZBdJC2zO0i8gXkHwDMq_3oP1KtYtaamqgIcLhdX4CQCAJIj2_4oTdqT_wYL5eEDak9wSJH_y1Q=@pm.me>
 <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com>
 <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org>
 <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com>
 <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org>
 <4e421d64-2ede-80ba-932a-e366b515133d@gmail.com>
Message-ID: <20190110173622.f54rtctpvftlh2lx@phare.normalesup.org>

On Thu, Jan 10, 2019 at 12:32:17PM -0500, Andreas Mueller wrote:
> Any sprint specific funding you're thinking of? Google gave in the past, right?

I was thinking of PSF.

Ga?l

From t3kcit at gmail.com  Thu Jan 10 12:54:09 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Thu, 10 Jan 2019 12:54:09 -0500
Subject: [scikit-learn] Next Sprint
In-Reply-To: <20190110173622.f54rtctpvftlh2lx@phare.normalesup.org>
References: <RX6iIIyFIwuS7OUZgAO0I2OvQSl3IJ2D4ZBdJC2zO0i8gXkHwDMq_3oP1KtYtaamqgIcLhdX4CQCAJIj2_4oTdqT_wYL5eEDak9wSJH_y1Q=@pm.me>
 <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com>
 <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org>
 <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com>
 <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org>
 <4e421d64-2ede-80ba-932a-e366b515133d@gmail.com>
 <20190110173622.f54rtctpvftlh2lx@phare.normalesup.org>
Message-ID: <0eb889b1-10f2-4649-f035-602c589a8c6c@gmail.com>

Do you or anyone in your team has cycles to do that?

I certainly don't, but I could try to delegate (to the single person I 
delegate everything to ;)

On 1/10/19 12:36 PM, Gael Varoquaux wrote:
> On Thu, Jan 10, 2019 at 12:32:17PM -0500, Andreas Mueller wrote:
>> Any sprint specific funding you're thinking of? Google gave in the past, right?
> I was thinking of PSF.
>
> Ga?l
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


From gael.varoquaux at normalesup.org  Thu Jan 10 13:19:05 2019
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Thu, 10 Jan 2019 19:19:05 +0100
Subject: [scikit-learn] Next Sprint
In-Reply-To: <0eb889b1-10f2-4649-f035-602c589a8c6c@gmail.com>
References: <RX6iIIyFIwuS7OUZgAO0I2OvQSl3IJ2D4ZBdJC2zO0i8gXkHwDMq_3oP1KtYtaamqgIcLhdX4CQCAJIj2_4oTdqT_wYL5eEDak9wSJH_y1Q=@pm.me>
 <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com>
 <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org>
 <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com>
 <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org>
 <4e421d64-2ede-80ba-932a-e366b515133d@gmail.com>
 <20190110173622.f54rtctpvftlh2lx@phare.normalesup.org>
 <0eb889b1-10f2-4649-f035-602c589a8c6c@gmail.com>
Message-ID: <20190110181905.6pyuuaj4vl4vdznz@phare.normalesup.org>

On Thu, Jan 10, 2019 at 12:54:09PM -0500, Andreas Mueller wrote:
> Do you or anyone in your team has cycles to do that?

I asked Guillaume Lema?tre to do it. He has started.

Ga?l

> I certainly don't, but I could try to delegate (to the single person I
> delegate everything to ;)

> On 1/10/19 12:36 PM, Gael Varoquaux wrote:
> > On Thu, Jan 10, 2019 at 12:32:17PM -0500, Andreas Mueller wrote:
> > > Any sprint specific funding you're thinking of? Google gave in the past, right?
> > I was thinking of PSF.

> > Ga?l
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn

> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From rohanlekhwani at gmail.com  Fri Jan 11 05:32:41 2019
From: rohanlekhwani at gmail.com (Rohan Lekhwani)
Date: Fri, 11 Jan 2019 16:02:41 +0530
Subject: [scikit-learn] GSoC 2019
Message-ID: <CAHRECVM115AHbgd5sVL64z=SWfDG=Ug1thSOu2A71RSZyMqBXQ@mail.gmail.com>

Hello,

I'm an undergraduate interested in participating in GSoC 2019. I wished to
enquire if scikit-learn would be participating under the umbrella of Python
Software Foundation this year as a sub-org.Thanks.

Rohan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190111/f6340a67/attachment.html>

From gael.varoquaux at normalesup.org  Wed Jan 16 05:49:48 2019
From: gael.varoquaux at normalesup.org (Gael Varoquaux)
Date: Wed, 16 Jan 2019 11:49:48 +0100
Subject: [scikit-learn] Non-core developers at the sprint
Message-ID: <20190116104948.d3hytjd3zvvcpuxl@phare.normalesup.org>

Dear users and developers,

We have a sprint coming up in Paris Feb 25th to March 1st:
https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events

Looking at the list of people who are coming, I am noticing that we have
mostly core developers. While the priority of the sprint is to work on
the big picture rather than onboarding, I am worried that there might be
some self-selection happening. I am sure that some excellent people, who
are contributors yet not core contributors could come.

I would like to invite people who already have contributed and want to
get more involved in the project to contact us to join the sprint.
Specifically, we are willing to fund accommodation and travel for one or
two participants.

Please send a short message to Guillaume Lema?tre
<guillaume.lemaitre at inria.fr> and myself presenting what you have
contributed and what you would like to contribute, as well as your
funding needs. We will curate this list and core contributors will settle
on who we can accommodate.

Cheers,

Ga?l

-- 
    Gael Varoquaux
    Senior Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux

From pahome.chen at mirlab.org  Wed Jan 16 23:29:04 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Thu, 17 Jan 2019 12:29:04 +0800
Subject: [scikit-learn] Any clustering algo to cluster multiple timing
 series data?
Message-ID: <CAB3eZfutecHemxkeNjBqdtGdg-GLNHCsAsM33-+CWs62aCUMcQ@mail.gmail.com>

Cluster algo cluster samples by calculating the euclidean distance.
I wonder if any clustering algo can cluster the timing series data?

EX:
Every items has there sold numbers of everyday.
Item,Day1,Day2,Day3,Day4,Day5
A,1,5,1,5,1
B,10,50,10,50,10,
C,4,70,30,10,50

The difference ratio of A and B are 500%,20%,500%,20%,
I want to make A&B be the same cluster, C is another one.

If I don't want to calculate the difference ratio of each samples

Is there anyway to cluster by the difference ratio of each samples?

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190117/b84910ea/attachment.html>

From mbrynildsen at grundfos.com  Thu Jan 17 02:05:25 2019
From: mbrynildsen at grundfos.com (Mikkel Haggren Brynildsen)
Date: Thu, 17 Jan 2019 07:05:25 +0000
Subject: [scikit-learn] Any clustering algo to cluster multiple timing
 series data?
In-Reply-To: <CAB3eZfutecHemxkeNjBqdtGdg-GLNHCsAsM33-+CWs62aCUMcQ@mail.gmail.com>
References: <CAB3eZfutecHemxkeNjBqdtGdg-GLNHCsAsM33-+CWs62aCUMcQ@mail.gmail.com>
Message-ID: <B4385681-8622-4B7E-924C-09D8603FCEA4@grundfos.com>

What about dynamic time warping ?

Sendt fra min iPhone

> Den 17. jan. 2019 kl. 05.31 skrev lampahome <pahome.chen at mirlab.org>:
> 
> Cluster algo cluster samples by calculating the euclidean distance.
> I wonder if any clustering algo can cluster the timing series data?
> 
> EX:
> Every items has there sold numbers of everyday.
> Item,Day1,Day2,Day3,Day4,Day5
> A,1,5,1,5,1
> B,10,50,10,50,10,
> C,4,70,30,10,50
> 
> The difference ratio of A and B are 500%,20%,500%,20%,
> I want to make A&B be the same cluster, C is another one.
> 
> If I don't want to calculate the difference ratio of each samples
> 
> Is there anyway to cluster by the difference ratio of each samples? 
> 
> thx
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From pahome.chen at mirlab.org  Thu Jan 17 02:45:11 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Thu, 17 Jan 2019 15:45:11 +0800
Subject: [scikit-learn] Any clustering algo to cluster multiple timing
 series data?
In-Reply-To: <B4385681-8622-4B7E-924C-09D8603FCEA4@grundfos.com>
References: <CAB3eZfutecHemxkeNjBqdtGdg-GLNHCsAsM33-+CWs62aCUMcQ@mail.gmail.com>
 <B4385681-8622-4B7E-924C-09D8603FCEA4@grundfos.com>
Message-ID: <CAB3eZfvmDV3kWKLCcD3wsBTv9BFWR0N3N_Qt25vg7e99TfCWrQ@mail.gmail.com>

Mikkel Haggren Brynildsen <mbrynildsen at grundfos.com> ? 2019?1?17? ??
??3:07???

> What about dynamic time warping ?
>

I thought DTW is used to different length of two datasets
But I only get the same length of two datasets.
Maybe it doesn't work?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190117/eb261552/attachment.html>

From mbrynildsen at grundfos.com  Thu Jan 17 02:58:39 2019
From: mbrynildsen at grundfos.com (Mikkel Haggren Brynildsen)
Date: Thu, 17 Jan 2019 07:58:39 +0000
Subject: [scikit-learn] Any clustering algo to cluster multiple timing
 series data?
In-Reply-To: <CAB3eZfvmDV3kWKLCcD3wsBTv9BFWR0N3N_Qt25vg7e99TfCWrQ@mail.gmail.com>
References: <CAB3eZfutecHemxkeNjBqdtGdg-GLNHCsAsM33-+CWs62aCUMcQ@mail.gmail.com>
 <B4385681-8622-4B7E-924C-09D8603FCEA4@grundfos.com>
 <CAB3eZfvmDV3kWKLCcD3wsBTv9BFWR0N3N_Qt25vg7e99TfCWrQ@mail.gmail.com>
Message-ID: <HE1P195MB01702BBC844C7B8B3581315BD6830@HE1P195MB0170.EURP195.PROD.OUTLOOK.COM>

You can use it to get a single similarity / closeness number between two timeseries and then feed that into a clustering algorithm.

For instance look at
https://github.com/markdregan/K-Nearest-Neighbors-with-Dynamic-Time-Warping


as a first idea:
if you expand the distance function d = lambda x,y: abs(x-y) to a multivariate local distance

d2 = lambda a,b: np.sqrt(float((a[0]-b[0])**2 + (a[1]-b[1])**2)
(or any other n-dim metric)

Then you have an algorithm that could cluster the timeseries.

It does also work when the timeseries are of equal length?

Best
Mikkel Brynildsen


From: scikit-learn <scikit-learn-bounces+mbrynildsen=grundfos.com at python.org> On Behalf Of lampahome
Sent: 17. januar 2019 08:45
To: Scikit-learn mailing list <scikit-learn at python.org>
Subject: Re: [scikit-learn] Any clustering algo to cluster multiple timing series data?


Mikkel Haggren Brynildsen <mbrynildsen at grundfos.com<mailto:mbrynildsen at grundfos.com>> ? 2019?1?17? ?? ??3:07???
What about dynamic time warping ?

I thought DTW is used to different length of two datasets
But I only get the same length of two datasets.
Maybe it doesn't work?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190117/09304db8/attachment-0001.html>

From alexandre.gramfort at inria.fr  Thu Jan 17 03:53:35 2019
From: alexandre.gramfort at inria.fr (Alexandre Gramfort)
Date: Thu, 17 Jan 2019 09:53:35 +0100
Subject: [scikit-learn] Any clustering algo to cluster multiple timing
 series data?
In-Reply-To: <HE1P195MB01702BBC844C7B8B3581315BD6830@HE1P195MB0170.EURP195.PROD.OUTLOOK.COM>
References: <CAB3eZfutecHemxkeNjBqdtGdg-GLNHCsAsM33-+CWs62aCUMcQ@mail.gmail.com>
 <B4385681-8622-4B7E-924C-09D8603FCEA4@grundfos.com>
 <CAB3eZfvmDV3kWKLCcD3wsBTv9BFWR0N3N_Qt25vg7e99TfCWrQ@mail.gmail.com>
 <HE1P195MB01702BBC844C7B8B3581315BD6830@HE1P195MB0170.EURP195.PROD.OUTLOOK.COM>
Message-ID: <CADeotZrg9DGqBzPfsh1ddpz_UKmEzBf_VsmV-gL7Poty3eLw_Q@mail.gmail.com>

you can have a look at :

https://tslearn.readthedocs.io/en/latest/

Alex

On Thu, Jan 17, 2019 at 9:01 AM Mikkel Haggren Brynildsen
<mbrynildsen at grundfos.com> wrote:
>
> You can use it to get a single similarity / closeness number between two timeseries and then feed that into a clustering algorithm.
>
>
>
> For instance look at
>
> https://github.com/markdregan/K-Nearest-Neighbors-with-Dynamic-Time-Warping
>
>
>
>
>
> as a first idea:
>
> if you expand the distance function d = lambda x,y: abs(x-y) to a multivariate local distance
>
>
>
> d2 = lambda a,b: np.sqrt(float((a[0]-b[0])**2 + (a[1]-b[1])**2)
>
> (or any other n-dim metric)
>
>
>
> Then you have an algorithm that could cluster the timeseries.
>
>
>
> It does also work when the timeseries are of equal length?
>
>
>
> Best
>
> Mikkel Brynildsen
>
>
>
>
>
> From: scikit-learn <scikit-learn-bounces+mbrynildsen=grundfos.com at python.org> On Behalf Of lampahome
> Sent: 17. januar 2019 08:45
> To: Scikit-learn mailing list <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Any clustering algo to cluster multiple timing series data?
>
>
>
>
>
>
>
> Mikkel Haggren Brynildsen <mbrynildsen at grundfos.com> ? 2019?1?17? ?? ??3:07???
>
> What about dynamic time warping ?
>
>
>
> I thought DTW is used to different length of two datasets
>
> But I only get the same length of two datasets.
>
> Maybe it doesn't work?
>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

From t3kcit at gmail.com  Fri Jan 18 12:18:52 2019
From: t3kcit at gmail.com (Andreas Mueller)
Date: Fri, 18 Jan 2019 12:18:52 -0500
Subject: [scikit-learn] Scipy 2019 Tutorial
Message-ID: <d2351353-ace5-4b7a-b772-5bd5dc3b1752@gmail.com>

Hey Folks.
The scipy tutorial chairs just pinged me about submitting a tutorial.
I'm planning to, and wanted to ask if anyone is interested in 
co-teaching with me.
I might transition from the "scipy tutorial" materials (evolved over 
maybe 5 years) to my own materials, but not sure yet.
Nicolas said he'd potentially be interested but I wanted to ask around 
who else is coming and might be interested.

Cheers,
Andy


From stefanv at berkeley.edu  Fri Jan 18 12:56:09 2019
From: stefanv at berkeley.edu (Stefan van der Walt)
Date: Fri, 18 Jan 2019 09:56:09 -0800
Subject: [scikit-learn] ANN: scikit-image 0.14.2
Message-ID: <20190118175609.yiiis7w4v6gjpo3n@carbo>

Announcement: scikit-image 0.14.2
=================================

This release handles an incompatibility between scikit-image and NumPy
1.16.0, released on January 13th 2019.

It contains the following changes from 0.14.1:

API changes
-----------
- ``skimage.measure.regionprops`` no longer removes singleton dimensions from
  label images (#3284). To recover the old behavior, replace
  ``regionprops(label_image)`` calls with
  ``regionprops(np.squeeze(label_image))``

Bug fixes
---------
- Address deprecation of NumPy ``_validate_lengths`` (backport of #3556)
- Correctly handle the maximum number of lines in Hough transforms
  (backport of #3514)
- Correctly implement early stopping criterion for rank kernel noise
  filter (backport of #3503)
- Fix ``skimage.measure.regionprops`` for 1x1 inputs (backport of #3284)

Enhancements
------------
- Rewrite of ``local_maxima`` with flood-fill (backport of #3022, #3447)

Build Process & Testing
-----------------------
- Dedicate a ``--pre`` build in appveyor (backport of #3222)
- Avoid Travis-CI failure regarding ``skimage.lookfor`` (backport of #3477)
- Stop using the ``pytest.fixtures`` decorator (#3558)
- Filter out DeprecationPendingWarning for matrix subclass (#3637)
- Fix matplotlib test warnings and circular import (#3632)

Contributors & Reviewers
------------------------
- Fran?ois Boulogne
- Emmanuelle Gouillart
- Lars Gr?ter
- Mark Harfouche
- Juan Nunez-Iglesias
- Egor Panfilov
- Stefan van der Walt

From hamidizade.s at gmail.com  Sun Jan 20 12:01:21 2019
From: hamidizade.s at gmail.com (S Hamidizade)
Date: Sun, 20 Jan 2019 20:31:21 +0330
Subject: [scikit-learn] Imblearn: SMOTENC
Message-ID: <CALx+=wvBtqiSXTR29W=ZQA_kx1Fz1LOdZNhZS=L7DAROfRpwXQ@mail.gmail.com>

Dear Scikit-learners
Hi.

I would greatly appreciate if you could let me know how to use SMOTENC.  I
wrote:

num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values)
cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values)
print(len(num_indices1))
print(len(cat_indices1))

pipeline=Pipeline(steps= [
    # Categorical features
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', MultiColumn(cat_indices1)),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', MultiColumn(num_indices1)),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', rg)
    ]
)

Therefore, as it is indicated I have 5 categorical features. Really,
indices 123 to 160 are related to one categorical feature with 37 possible
values which is converted into 37 columns using get_dummies.
 Sorry, I think SMOTENC should be inserted before the classifier ('clf',
reg) but I don't know how to define "categorical_features" in SMOTENC.
Besides, could you please let me know where to use imblearn.pipeline?

Thanks in advance.
Best regards,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190120/f53129d9/attachment.html>

From g.lemaitre58 at gmail.com  Mon Jan 21 05:54:01 2019
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Mon, 21 Jan 2019 11:54:01 +0100
Subject: [scikit-learn] Imblearn: SMOTENC
In-Reply-To: <CALx+=wvBtqiSXTR29W=ZQA_kx1Fz1LOdZNhZS=L7DAROfRpwXQ@mail.gmail.com>
References: <CALx+=wvBtqiSXTR29W=ZQA_kx1Fz1LOdZNhZS=L7DAROfRpwXQ@mail.gmail.com>
Message-ID: <CACDxx9g83YwciOcgNz7_UM-PkQC95Rjif_KQyy_=k-BA1MmmCA@mail.gmail.com>

SMOTENC will internally one hot encode the features, generate new features,
and finally decode.
So you need to do something like:


from imblearn.pipeline import make_pipeline, Pipeline

num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values)
cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values)
print(len(num_indices1))
print(len(cat_indices1))

pipeline=Pipeline(steps= [
    # Categorical features
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', MultiColumn(cat_indices1)),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', MultiColumn(num_indices1)),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', rg)
    ]
)

pipeline_with_resampling =
make_pipeline(SMOTENC(categorical_features=cat_indices_1), pipeline)


On Sun, 20 Jan 2019 at 18:05, S Hamidizade <hamidizade.s at gmail.com> wrote:

> Dear Scikit-learners
> Hi.
>
> I would greatly appreciate if you could let me know how to use SMOTENC.  I
> wrote:
>
> num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values)
> cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values)
> print(len(num_indices1))
> print(len(cat_indices1))
>
> pipeline=Pipeline(steps= [
>     # Categorical features
>     ('feature_processing', FeatureUnion(transformer_list = [
>             ('categorical', MultiColumn(cat_indices1)),
>
>             #numeric
>             ('numeric', Pipeline(steps = [
>                 ('select', MultiColumn(num_indices1)),
>                 ('scale', StandardScaler())
>                         ]))
>         ])),
>     ('clf', rg)
>     ]
> )
>
> Therefore, as it is indicated I have 5 categorical features. Really,
> indices 123 to 160 are related to one categorical feature with 37 possible
> values which is converted into 37 columns using get_dummies.
>  Sorry, I think SMOTENC should be inserted before the classifier ('clf',
> reg) but I don't know how to define "categorical_features" in SMOTENC.
> Besides, could you please let me know where to use imblearn.pipeline?
>
> Thanks in advance.
> Best regards,
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190121/ffe1d690/attachment-0001.html>

From pahome.chen at mirlab.org  Mon Jan 21 05:56:36 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Mon, 21 Jan 2019 18:56:36 +0800
Subject: [scikit-learn] Any clustering algo to cluster multiple timing
 series data?
In-Reply-To: <CADeotZrg9DGqBzPfsh1ddpz_UKmEzBf_VsmV-gL7Poty3eLw_Q@mail.gmail.com>
References: <CAB3eZfutecHemxkeNjBqdtGdg-GLNHCsAsM33-+CWs62aCUMcQ@mail.gmail.com>
 <B4385681-8622-4B7E-924C-09D8603FCEA4@grundfos.com>
 <CAB3eZfvmDV3kWKLCcD3wsBTv9BFWR0N3N_Qt25vg7e99TfCWrQ@mail.gmail.com>
 <HE1P195MB01702BBC844C7B8B3581315BD6830@HE1P195MB0170.EURP195.PROD.OUTLOOK.COM>
 <CADeotZrg9DGqBzPfsh1ddpz_UKmEzBf_VsmV-gL7Poty3eLw_Q@mail.gmail.com>
Message-ID: <CAB3eZfswGtP+hZLE_69fEzKPptKmB+T4Bb4GHtKNSnHcsnTaMA@mail.gmail.com>

How about scaling data first by MinMaxScaler and then cluster?

What I thought is scaling can scale then into 0~1 section, and it can
ignore the quantity of each data

After scaling, it shows the increasing/decreasing ratio between each points.

Then cluster then by the eucledian distance should work?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190121/efd9041b/attachment.html>

From lope at usal.es  Tue Jan 22 04:55:44 2019
From: lope at usal.es (=?UTF-8?B?RGFuaWVsIEzDs3Blei1Tw6FuY2hleg==?=)
Date: Tue, 22 Jan 2019 10:55:44 +0100
Subject: [scikit-learn] PR #13003: [MRG] Add Tensor Sketch algorithm to
 Kernel Approximation module
Message-ID: <CAA5O=+zjeQsZe8sAT2-WC0GkV6L=ykMH6uRh_YHRrTdCGoegAA@mail.gmail.com>

Dear all,

I recently posted a PR
<https://github.com/scikit-learn/scikit-learn/pull/13003> which adds the
Tensor Sketch algorithm [1] to the Kernel Approximation module of
Scikit-learn.

I believe this new feature makes the Kernel Approximation module more
complete by providing a data-independent method for polynomial kernel
approximation, as the currently included methods either require access to
training data (Nystroem) or do not work with polynomial kernels. The
implementation has been tested to provide the same results as the original
Matlab implementation provided by the author of [1].

I would appreciate any feedback you can provide,

Regards,

[1] Pham, N., & Pagh, R. (2013, August). Fast and scalable polynomial
kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD
international conference on Knowledge discovery and data mining (pp.
239-247). ACM.

*Daniel L?pez S?nchez* <https://github.com/lopelh>
lope at usal.es / (+34) 687174328

BISITE Research Group (http://bisite.usal.es <http://bisite.usal.es/en>)
Edificio I+D+i Universidad de Salamanca, C/ Espejo S/N, 37007
Salamanca, Spain
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190122/b122e1a4/attachment.html>

From adrin.jalali at gmail.com  Tue Jan 22 05:02:06 2019
From: adrin.jalali at gmail.com (Adrin)
Date: Tue, 22 Jan 2019 11:02:06 +0100
Subject: [scikit-learn] PR #13003: [MRG] Add Tensor Sketch algorithm to
 Kernel Approximation module
In-Reply-To: <CAA5O=+zjeQsZe8sAT2-WC0GkV6L=ykMH6uRh_YHRrTdCGoegAA@mail.gmail.com>
References: <CAA5O=+zjeQsZe8sAT2-WC0GkV6L=ykMH6uRh_YHRrTdCGoegAA@mail.gmail.com>
Message-ID: <CAEOrW48wKdrW2B9MNviMsCG+hRthVa2q28j=fp=OskSt0fi94g@mail.gmail.com>

Hi Daniel,

Thanks for the note, but sometimes there can be quite some delay in us
reviewing a PR; and the discussion about a PR best should happen on the PR
itself.

Best,
Adrin.

On Tue, 22 Jan 2019 at 10:57 Daniel L?pez-S?nchez <lope at usal.es> wrote:

> Dear all,
>
> I recently posted a PR
> <https://github.com/scikit-learn/scikit-learn/pull/13003> which adds the
> Tensor Sketch algorithm [1] to the Kernel Approximation module of
> Scikit-learn.
>
> I believe this new feature makes the Kernel Approximation module more
> complete by providing a data-independent method for polynomial kernel
> approximation, as the currently included methods either require access to
> training data (Nystroem) or do not work with polynomial kernels. The
> implementation has been tested to provide the same results as the original
> Matlab implementation provided by the author of [1].
>
> I would appreciate any feedback you can provide,
>
> Regards,
>
> [1] Pham, N., & Pagh, R. (2013, August). Fast and scalable polynomial
> kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD
> international conference on Knowledge discovery and data mining (pp.
> 239-247). ACM.
>
> *Daniel L?pez S?nchez* <https://github.com/lopelh>
> lope at usal.es / (+34) 687174328 <+34%20687%2017%2043%2028>
>
> BISITE Research Group (http://bisite.usal.es <http://bisite.usal.es/en>)
> Edificio I+D+i Universidad de Salamanca, C/ Espejo S/N, 37007
> Salamanca, Spain
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190122/430290c9/attachment.html>

From pahome.chen at mirlab.org  Wed Jan 23 05:35:57 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Wed, 23 Jan 2019 18:35:57 +0800
Subject: [scikit-learn] Affinity Propagation is the best algo for without
 choosing the number of cluster?
Message-ID: <CAB3eZfvZqFgivWWZvrygEVM+bxR0Lc87NmMRsEdM-9L=gef--g@mail.gmail.com>

I search for clustering algo to cluster into groups without considering
about number of groups.

I found AP algo which needn't choose the number of clusters.

In my experiments, AP cluster well without choosing any parameters.

But I'm not sure any corner case which will caused clustering worse.

Does anyone try AP and found some side-effect? or the way to tune
parameters?

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190123/1e93cf81/attachment.html>

From ndbecker2 at gmail.com  Wed Jan 23 13:26:44 2019
From: ndbecker2 at gmail.com (Neal Becker)
Date: Wed, 23 Jan 2019 13:26:44 -0500
Subject: [scikit-learn] affinity propagation not giving desired answer
Message-ID: <q2abl4$qtp$1@blaine.gmane.org>

I am not too familiar with affinity propagation, but just trying it out.  
The problem is to cluster using a distance metric that is euclidean distance 
but with a limit.  When the distance is greater than some threshold than the 
metric is -Inf.  In other words, a point can be accepted into a cluster only 
if the distance from the point to the cluster center is less than some 
threshold.

It seems my test with affinity propagation will sometimes produce a correct 
result, but other times the result seems to violate the condition.  In the 
example code, a couple of outlier points seem to be in clusters that are not 
close at all.

I've tried playing with parameters (such as preference) without eliminating 
the problem.  Any suggestions?

---------
import numpy as np
from sklearn.cluster import AffinityPropagation

# from randomgen import RandomGenerator, Xoroshiro128
# rs = RandomGenerator (Xoroshiro128 (0))
from numpy.random import RandomState
rs = RandomState(3)
pts = rs.uniform (-5, 5, (50,2))
import seaborn as sns
import matplotlib.pyplot as plt

def distance (ax, ay, bx, by):
    d = (ax - bx)**2 + (ay - by)**2
    if d > 1:
        return -1e6
    else:
        return -d
    
d = np.empty ((pts.shape[0], pts.shape[0]))
for i in range(pts.shape[0]):
    for j in range(pts.shape[0]):
        d[i,j] = distance(pts[i,0], pts[i,1], pts[j,0], pts[j,1])

preference = -20 #np.mean (d[d > -1e6])
print ('preference:', preference)
clustering = AffinityPropagation(affinity='precomputed', verbose=True, 
preference=preference)

res = clustering.fit(d)
c = clustering
colors = np.array(sns.color_palette("hls", np.max(c.labels_)+1))
print('n_clusters:', np.max(c.labels_)+1)
centers = pts[c.cluster_centers_indices_]
plt.scatter (pts[:,0], pts[:,1], c=colors[c.labels_])
plt.scatter (centers[:,0], centers[:,1], marker='X', s=100, c=colors)
plt.show()


From ndbecker2 at gmail.com  Wed Jan 23 15:01:13 2019
From: ndbecker2 at gmail.com (Neal Becker)
Date: Wed, 23 Jan 2019 15:01:13 -0500
Subject: [scikit-learn] cluster.affinity_propagation doesn't accept sparse?
Message-ID: <q2ah69$512t$1@blaine.gmane.org>

It appears affinity propagation would appear to accept sparse similarity 
input:

        X = check_array(X, accept_sparse='csr')

But if I try it, I get:

~/.local/lib/python3.7/site-
packages/sklearn/cluster/affinity_propagation_.py in affinity_propagation(S, 
preference, convergence_iter, max_iter, damping, copy, verbose, 
return_n_iter)
    137 
    138     # Place preference on the diagonal of S
--> 139     S.flat[::(n_samples + 1)] = preference
    140 
    141     A = np.zeros((n_samples, n_samples))

~/.local/lib/python3.7/site-packages/scipy/sparse/base.py in 
__getattr__(self, attr)
    687             return self.getnnz()
    688         else:
--> 689             raise AttributeError(attr + " not found")
    690 
    691     def transpose(self, axes=None, copy=False):

AttributeError: flat not found


From hamidizade.s at gmail.com  Thu Jan 24 01:09:55 2019
From: hamidizade.s at gmail.com (S Hamidizade)
Date: Thu, 24 Jan 2019 09:39:55 +0330
Subject: [scikit-learn] Imblearn: SMOTENC
In-Reply-To: <CACDxx9g83YwciOcgNz7_UM-PkQC95Rjif_KQyy_=k-BA1MmmCA@mail.gmail.com>
References: <CALx+=wvBtqiSXTR29W=ZQA_kx1Fz1LOdZNhZS=L7DAROfRpwXQ@mail.gmail.com>
 <CACDxx9g83YwciOcgNz7_UM-PkQC95Rjif_KQyy_=k-BA1MmmCA@mail.gmail.com>
Message-ID: <CALx+=wsCBp5cdZe7SCjCSn3mL6OP4uztt96s8Rckkndg+sin+A@mail.gmail.com>

Dear Mr. Lemaitre

Thanks a lot for sharing your time and knowledge. Unfortunately, it throws
the following error:

Traceback (most recent call last):
119
  File
"D:/mifs-master_2/MU/learning-from-imbalanced-classes-master/learning-from-imbalanced-classes-master/continuous/Final
Logit/SMOTENC/logit-final - Copy.py", line 419, in <module>
41
    pipeline_with_resampling =
make_pipeline(SMOTENC(categorical_features=cat_indices1), pipeline)
  File
"C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\pipeline.py", line
594, in make_pipeline
    return Pipeline(_name_estimators(steps), memory=memory)
  File
"C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\pipeline.py", line
119, in __init__
    self._validate_steps()
  File
"C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\pipeline.py", line
167, in _validate_steps
    " '%s' (type %s) doesn't" % (t, type(t)))
TypeError: All intermediate steps should be transformers and implement fit
and transform. 'SMOTENC(categorical_features=['x95', 'x97', 'x99', 'x100',
'x121_1', 'x121_2', 'x121_3', 'x121_4', 'x121_5', 'x121_6', 'x121_7',
'x121_8', 'x121_9', 'x121_10', 'x121_11', 'x121_12', 'x121_13', 'x121_14',
'x121_15', 'x121_16', 'x121_17', 'x121_18', 'x121_19', 'x121_20',
'x121_21', 'x121_22', 'x121_23', 'x121_24', 'x121_25', 'x121_26',
'x121_27', 'x121_28', 'x121_29', 'x121_30', 'x121_31', 'x121_32',
'x121_33', 'x121_34', 'x121_35', 'x121_36', 'x121_37'],
    k_neighbors=5, n_jobs=1, random_state=None, sampling_strategy='auto')'
(type <class 'imblearn.over_sampling._smote.SMOTENC'>) doesn't

Thanks in advance.
Best regards,

On Mon, Jan 21, 2019 at 2:26 PM Guillaume Lema?tre <g.lemaitre58 at gmail.com>
wrote:

> SMOTENC will internally one hot encode the features, generate new
> features, and finally decode.
> So you need to do something like:
>
>
> from imblearn.pipeline import make_pipeline, Pipeline
>
> num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values)
> cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values)
> print(len(num_indices1))
> print(len(cat_indices1))
>
> pipeline=Pipeline(steps= [
>     # Categorical features
>     ('feature_processing', FeatureUnion(transformer_list = [
>             ('categorical', MultiColumn(cat_indices1)),
>
>             #numeric
>             ('numeric', Pipeline(steps = [
>                 ('select', MultiColumn(num_indices1)),
>                 ('scale', StandardScaler())
>                         ]))
>         ])),
>     ('clf', rg)
>     ]
> )
>
> pipeline_with_resampling = make_pipeline(SMOTENC(categorical_features=cat_indices_1), pipeline)
>
>
>
>
> On Sun, 20 Jan 2019 at 18:05, S Hamidizade <hamidizade.s at gmail.com> wrote:
>
>> Dear Scikit-learners
>> Hi.
>>
>> I would greatly appreciate if you could let me know how to use
>> SMOTENC.  I wrote:
>>
>> num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values)
>> cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values)
>> print(len(num_indices1))
>> print(len(cat_indices1))
>>
>> pipeline=Pipeline(steps= [
>>     # Categorical features
>>     ('feature_processing', FeatureUnion(transformer_list = [
>>             ('categorical', MultiColumn(cat_indices1)),
>>
>>             #numeric
>>             ('numeric', Pipeline(steps = [
>>                 ('select', MultiColumn(num_indices1)),
>>                 ('scale', StandardScaler())
>>                         ]))
>>         ])),
>>     ('clf', rg)
>>     ]
>> )
>>
>> Therefore, as it is indicated I have 5 categorical features. Really,
>> indices 123 to 160 are related to one categorical feature with 37 possible
>> values which is converted into 37 columns using get_dummies.
>>  Sorry, I think SMOTENC should be inserted before the classifier ('clf',
>> reg) but I don't know how to define "categorical_features" in SMOTENC.
>> Besides, could you please let me know where to use imblearn.pipeline?
>>
>> Thanks in advance.
>> Best regards,
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190124/b281b077/attachment-0001.html>

From g.lemaitre58 at gmail.com  Thu Jan 24 02:04:33 2019
From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=)
Date: Thu, 24 Jan 2019 08:04:33 +0100
Subject: [scikit-learn] Imblearn: SMOTENC
In-Reply-To: <CALx+=wsCBp5cdZe7SCjCSn3mL6OP4uztt96s8Rckkndg+sin+A@mail.gmail.com>
Message-ID: <8lp16dn7dcdhmc9ec970igje.1548313473132@gmail.com>

As stated in the doc, categorical_features are the indices of the categorical column and not the name of the columns. This is similar to the one hot encoder API. 

Sent from my phone - sorry to be brief and potential misspell.


From pahome.chen at mirlab.org  Thu Jan 24 04:13:18 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Thu, 24 Jan 2019 17:13:18 +0800
Subject: [scikit-learn] How to determine suitable cluster algo
Message-ID: <CAB3eZfv=t-OYc=3Emx+qtCp+EaviP0YU-3tWe3dn8_A3=hNxFQ@mail.gmail.com>

I want to do customized clustering algo for my datasets, that's cuz I don't
want to try every algo and its hyperparameters.

I though I just define the default range of import hyperparameters ex:
number of cluster in K-means.

I want to iterate some possible clutering alog like K-means, DBSCAN,
AP...etc, and I choose the suitable algo to cluster for me.

I'm not sure if that is able to do, but does GridSearchCV work for me?

Or any other ways to determine that?

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190124/66001aa5/attachment.html>

From matti.v.viljamaa at gmail.com  Thu Jan 24 04:42:12 2019
From: matti.v.viljamaa at gmail.com (Matti Viljamaa)
Date: Thu, 24 Jan 2019 11:42:12 +0200
Subject: [scikit-learn] How to determine suitable cluster algo
In-Reply-To: <CAB3eZfv=t-OYc=3Emx+qtCp+EaviP0YU-3tWe3dn8_A3=hNxFQ@mail.gmail.com>
References: <CAB3eZfv=t-OYc=3Emx+qtCp+EaviP0YU-3tWe3dn8_A3=hNxFQ@mail.gmail.com>
Message-ID: <5c498874.1c69fb81.a65c3.68df@mx.google.com>

GridSearchCV is meant for tuning hyperparameters of a model over some ranges of configurations and parameter values. Like the documentation explains:

https://scikit-learn.org/stable/modules/grid_search.html

(and it also has some examples)

The (e.g. 10-fold) cross-validation as measure of accuracy (how accurately do different folds attain the value of the statistic) and generalization (that the accuracy remains similar between folds) is at least that what I?m taught at uni.

A greater problem is how can one decide, what parameters or e.g. parameter ranges to look for. Since some e.g. float-valued parameters might have some ranges that are ?more often used?, while some others that may not work for most of the time. Additionally e.g. the kernels and stuff have some which may have more general robustness, while some others may become computationally very expensive, when combined with some other parameters (such as that in MLPClassifier some activation functions and hidden_layer_sizes may correlate in increased computation cost, while not necessarily increasing accuracy).

The best I?ve figured so far is to:

Start with few of the most often used / major parameters and try to get them to produce results that are as accurate as possible with still affordable computation time. Only after that consider adding more params.

However, I?ve not found much info regarding how the parameters of different methods are ordered in terms of ?significance?. One could assume that by the preceding ones are more major than the following ones. However, some of the parameters also clearly ?correlate? between each other, so they have cross-effects on accuracy etc.

Best is probably just start trying and then perhaps write down, if you notice some general patterns as to what works?

There?s also:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

for designing ?pipelines? or sort of ?Design of Experiments? on sklearn algos. Also found this: 
https://towardsdatascience.com/design-your-engineering-experiment-plan-with-a-simple-python-command-35a6ba52fa35
but have not tried it, nor know if it?s necessary.

BR, Matti

L?hetetty Windows 10:n S?hk?postista

L?hett?j?: lampahome
L?hetetty: Thursday, 24 January 2019 11.14
Vastaanottaja: Scikit-learn mailing list
Aihe: [scikit-learn] How to determine suitable cluster algo

I want to do customized clustering algo for my datasets, that's cuz I don't want to try every algo and its hyperparameters.

I though I just define the default range of import hyperparameters ex: number of cluster in K-means.

I want to iterate some possible clutering alog like K-means, DBSCAN, AP...etc, and I choose the suitable algo to cluster for me.

I'm not sure if that is able to do, but does GridSearchCV work for me?

Or any other ways to determine that?

thx


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190124/f079dbd4/attachment.html>

From hamidizade.s at gmail.com  Thu Jan 24 10:17:46 2019
From: hamidizade.s at gmail.com (S Hamidizade)
Date: Thu, 24 Jan 2019 18:47:46 +0330
Subject: [scikit-learn] Imblearn: SMOTENC
In-Reply-To: <CALx+=wvBtqiSXTR29W=ZQA_kx1Fz1LOdZNhZS=L7DAROfRpwXQ@mail.gmail.com>
References: <CALx+=wvBtqiSXTR29W=ZQA_kx1Fz1LOdZNhZS=L7DAROfRpwXQ@mail.gmail.com>
Message-ID: <CALx+=wu+wwukQP9OA2XfzKCCSGCMqPnn_PXaCyT_KtAnONnjCQ@mail.gmail.com>

Thanks. Unfortunately, now the error is:
ValueError: Some of the categorical indices are out of range. Indices
should be between 0 and 160.
Best regards,

On Sun, Jan 20, 2019 at 8:31 PM S Hamidizade <hamidizade.s at gmail.com> wrote:

> Dear Scikit-learners
> Hi.
>
> I would greatly appreciate if you could let me know how to use SMOTENC.  I
> wrote:
>
> num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values)
> cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values)
> print(len(num_indices1))
> print(len(cat_indices1))
>
> pipeline=Pipeline(steps= [
>     # Categorical features
>     ('feature_processing', FeatureUnion(transformer_list = [
>             ('categorical', MultiColumn(cat_indices1)),
>
>             #numeric
>             ('numeric', Pipeline(steps = [
>                 ('select', MultiColumn(num_indices1)),
>                 ('scale', StandardScaler())
>                         ]))
>         ])),
>     ('clf', rg)
>     ]
> )
>
> Therefore, as it is indicated I have 5 categorical features. Really,
> indices 123 to 160 are related to one categorical feature with 37 possible
> values which is converted into 37 columns using get_dummies.
>  Sorry, I think SMOTENC should be inserted before the classifier ('clf',
> reg) but I don't know how to define "categorical_features" in SMOTENC.
> Besides, could you please let me know where to use imblearn.pipeline?
>
> Thanks in advance.
> Best regards,
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190124/c0017f10/attachment-0001.html>

From g.lemaitre58 at gmail.com  Thu Jan 24 10:43:04 2019
From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=)
Date: Thu, 24 Jan 2019 16:43:04 +0100
Subject: [scikit-learn] Imblearn: SMOTENC
In-Reply-To: <CALx+=wu+wwukQP9OA2XfzKCCSGCMqPnn_PXaCyT_KtAnONnjCQ@mail.gmail.com>
References: <CALx+=wvBtqiSXTR29W=ZQA_kx1Fz1LOdZNhZS=L7DAROfRpwXQ@mail.gmail.com>
 <CALx+=wu+wwukQP9OA2XfzKCCSGCMqPnn_PXaCyT_KtAnONnjCQ@mail.gmail.com>
Message-ID: <CACDxx9gb2THSdsY9UKk1oCrp3woRU_Pr1FeV_18L+URx7J6fAA@mail.gmail.com>

You should open a ticket on imbalanced-learn GitHub issue. This is easier
to post a reproducible example and for us to test it.
>From the error message, I can understand that you have 161 features and
require a feature above the index 160.


On Thu, 24 Jan 2019 at 16:19, S Hamidizade <hamidizade.s at gmail.com> wrote:

> Thanks. Unfortunately, now the error is:
> ValueError: Some of the categorical indices are out of range. Indices
> should be between 0 and 160.
> Best regards,
>
> On Sun, Jan 20, 2019 at 8:31 PM S Hamidizade <hamidizade.s at gmail.com>
> wrote:
>
>> Dear Scikit-learners
>> Hi.
>>
>> I would greatly appreciate if you could let me know how to use
>> SMOTENC.  I wrote:
>>
>> num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values)
>> cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values)
>> print(len(num_indices1))
>> print(len(cat_indices1))
>>
>> pipeline=Pipeline(steps= [
>>     # Categorical features
>>     ('feature_processing', FeatureUnion(transformer_list = [
>>             ('categorical', MultiColumn(cat_indices1)),
>>
>>             #numeric
>>             ('numeric', Pipeline(steps = [
>>                 ('select', MultiColumn(num_indices1)),
>>                 ('scale', StandardScaler())
>>                         ]))
>>         ])),
>>     ('clf', rg)
>>     ]
>> )
>>
>> Therefore, as it is indicated I have 5 categorical features. Really,
>> indices 123 to 160 are related to one categorical feature with 37 possible
>> values which is converted into 37 columns using get_dummies.
>>  Sorry, I think SMOTENC should be inserted before the classifier ('clf',
>> reg) but I don't know how to define "categorical_features" in SMOTENC.
>> Besides, could you please let me know where to use imblearn.pipeline?
>>
>> Thanks in advance.
>> Best regards,
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190124/cbb8f48d/attachment.html>

From pahome.chen at mirlab.org  Thu Jan 24 20:40:41 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Fri, 25 Jan 2019 09:40:41 +0800
Subject: [scikit-learn] How to determine suitable cluster algo
In-Reply-To: <5c498874.1c69fb81.a65c3.68df@mx.google.com>
References: <CAB3eZfv=t-OYc=3Emx+qtCp+EaviP0YU-3tWe3dn8_A3=hNxFQ@mail.gmail.com>
 <5c498874.1c69fb81.a65c3.68df@mx.google.com>
Message-ID: <CAB3eZfumGU28nink3HDfe55q+=kVbUhJVrRoJQFtKPrMN4at9g@mail.gmail.com>

Maybe the suitable way is try-and-error?

What I'm interesting is that my datasets is very huge and I can't try
number of cluster from 1 to N if I have N samples
That cost too much time for me.

Maybe I should define the initial number of cluster based on execution time?

Then analyze the next step is increase/decrease the number of cluster?

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/135592ec/attachment.html>

From matti.v.viljamaa at gmail.com  Fri Jan 25 06:43:35 2019
From: matti.v.viljamaa at gmail.com (Matti Viljamaa)
Date: Fri, 25 Jan 2019 13:43:35 +0200
Subject: [scikit-learn] How to determine suitable cluster algo
In-Reply-To: <CAB3eZfumGU28nink3HDfe55q+=kVbUhJVrRoJQFtKPrMN4at9g@mail.gmail.com>
References: <CAB3eZfv=t-OYc=3Emx+qtCp+EaviP0YU-3tWe3dn8_A3=hNxFQ@mail.gmail.com>
 <5c498874.1c69fb81.a65c3.68df@mx.google.com>
 <CAB3eZfumGU28nink3HDfe55q+=kVbUhJVrRoJQFtKPrMN4at9g@mail.gmail.com>
Message-ID: <5c4af668.1c69fb81.ee649.a884@mx.google.com>

For determining what one can afford computaionally, see e.g.:
https://stackoverflow.com/questions/22443041/predicting-how-long-an-scikit-learn-classification-will-take-to-run
https://www.reddit.com/r/scikit_learn/comments/a746h0/is_there_any_way_to_estimate_how_long_a_given/

L?hetetty Windows 10:n S?hk?postista

L?hett?j?: lampahome
L?hetetty: Friday, 25 January 2019 3.42
Vastaanottaja: Scikit-learn mailing list
Aihe: Re: [scikit-learn] How to determine suitable cluster algo

Maybe the suitable way is try-and-error?

What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples
That cost too much time for me.

Maybe I should define the initial number of cluster based on execution time?

Then analyze the next step is increase/decrease the number of cluster?

thx


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/e14c4dfb/attachment.html>

From liam at chatdesk.com  Fri Jan 25 12:26:37 2019
From: liam at chatdesk.com (Liam Geron)
Date: Fri, 25 Jan 2019 12:26:37 -0500
Subject: [scikit-learn] Google Cloud ML Error
Message-ID: <CAJn_aE5Gvwa5hf9YLVz4Xo3mkN+Hfg1f=Y2Jx2yy4ibYJEv4BA@mail.gmail.com>

Hi scikit learn contributors,

I am currently attempting to transfer our preexisting models into cloud ML
for scalability, however I am encountering bugs while running through some
tutorial code found here (
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb
).

On both my local machine in a virtual environment and on the cloud shell
I'm encountering errors when it comes to version creation and online
prediction. For version creation on my local machine and on the cloud shell
I'm encountering this error: *"ERROR: (gcloud.ml-engine.versions.create)
Bad model detected with error:  "Failed to load model: Could not load the
model: /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python
3.6.4 (local) and Python 3.5.6 (cloud shell) when running the command:

*"gcloud ml-engine versions create $VERSION_NAME \*
*    --model $MODEL_NAME \*
*    --config config.yaml"*

Any help would be greatly appreciated.

Thank you,
Liam Geron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/7f4ccd40/attachment.html>

From ross at cgl.ucsf.edu  Fri Jan 25 13:24:03 2019
From: ross at cgl.ucsf.edu (Bill Ross)
Date: Fri, 25 Jan 2019 10:24:03 -0800
Subject: [scikit-learn] Google Cloud ML Error
In-Reply-To: <CAJn_aE5Gvwa5hf9YLVz4Xo3mkN+Hfg1f=Y2Jx2yy4ibYJEv4BA@mail.gmail.com>
References: <CAJn_aE5Gvwa5hf9YLVz4Xo3mkN+Hfg1f=Y2Jx2yy4ibYJEv4BA@mail.gmail.com>
Message-ID: <196fae0d-33dd-4f98-4355-7dfaae383971@cgl.ucsf.edu>

Dumb generic cross-check from supporting compchem code in the day: What 
do these give? Might yield a clue, e.g. all model files seeing this got 
corrupted somehow.

$ file */tmp/model/0001/model.joblib*

*$ ls -l ***/tmp/model/0001/model.joblib**

**
**

On 1/25/19 9:26 AM, Liam Geron wrote:
> Hi scikit learn contributors,
>
> I am currently attempting to transfer our preexisting models into 
> cloud ML for scalability, however I am encountering bugs while running 
> through some tutorial code found 
> here?(https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb).
>
> On both my local machine in a virtual environment and on the cloud 
> shell I'm encountering errors when it comes to version creation and 
> online prediction. For version creation on my local machine and on the 
> cloud shell I'm encountering this error: *"ERROR: 
> (gcloud.ml-engine.versions.create) Bad model detected with error:? 
> "Failed to load model: Could not load the model: 
> /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python 3.6.4 
> (local) and Python 3.5.6 (cloud shell) when running the command:
>
> *"gcloud ml-engine versions create $VERSION_NAME \*
> *? ? --model $MODEL_NAME \*
> *? ? --config config.yaml"*
>
> Any help would be greatly appreciated.
>
> Thank you,
> Liam Geron
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/e64eac92/attachment.html>

From liam at chatdesk.com  Fri Jan 25 13:54:21 2019
From: liam at chatdesk.com (Liam Geron)
Date: Fri, 25 Jan 2019 13:54:21 -0500
Subject: [scikit-learn] Google Cloud ML Error
In-Reply-To: <196fae0d-33dd-4f98-4355-7dfaae383971@cgl.ucsf.edu>
References: <CAJn_aE5Gvwa5hf9YLVz4Xo3mkN+Hfg1f=Y2Jx2yy4ibYJEv4BA@mail.gmail.com>
 <196fae0d-33dd-4f98-4355-7dfaae383971@cgl.ucsf.edu>
Message-ID: <CAJn_aE7BMvmcYCRaFhu4VjwFDzN84tobsbAxLjKvyejHdqQPhw@mail.gmail.com>

No such luck, the file doesn't seem to exist. Here's the output on my local:*
"ls: /tmp/model/0001/model.joblib: No such file or directory"*

and *"/tmp/model/0001/model.joblib: cannot open
`/tmp/model/0001/model.joblib' (No such file or directory)"*

and on the cloud shell: *"ls: cannot access '/tmp/model/0001/model.joblib':
No such file or directory"*

and *"/bin/sh: 1: file: not found".*

On Fri, Jan 25, 2019 at 1:29 PM Bill Ross <ross at cgl.ucsf.edu> wrote:

> Dumb generic cross-check from supporting compchem code in the day: What do
> these give? Might yield a clue, e.g. all model files seeing this got
> corrupted somehow.
>
> $ file */tmp/model/0001/model.joblib*
>
> *$ ls -l **/tmp/model/0001/model.joblib*
>
>
> On 1/25/19 9:26 AM, Liam Geron wrote:
>
> Hi scikit learn contributors,
>
> I am currently attempting to transfer our preexisting models into cloud ML
> for scalability, however I am encountering bugs while running through some
> tutorial code found here (
> https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb
> ).
>
> On both my local machine in a virtual environment and on the cloud shell
> I'm encountering errors when it comes to version creation and online
> prediction. For version creation on my local machine and on the cloud shell
> I'm encountering this error: *"ERROR: (gcloud.ml-engine.versions.create)
> Bad model detected with error:  "Failed to load model: Could not load the
> model: /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python
> 3.6.4 (local) and Python 3.5.6 (cloud shell) when running the command:
>
> *"gcloud ml-engine versions create $VERSION_NAME \*
> *    --model $MODEL_NAME \*
> *    --config config.yaml"*
>
> Any help would be greatly appreciated.
>
> Thank you,
> Liam Geron
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/6a9888c4/attachment.html>

From ross at cgl.ucsf.edu  Fri Jan 25 14:33:01 2019
From: ross at cgl.ucsf.edu (Bill Ross)
Date: Fri, 25 Jan 2019 11:33:01 -0800
Subject: [scikit-learn] Google Cloud ML Error
In-Reply-To: <CAJn_aE7BMvmcYCRaFhu4VjwFDzN84tobsbAxLjKvyejHdqQPhw@mail.gmail.com>
References: <CAJn_aE5Gvwa5hf9YLVz4Xo3mkN+Hfg1f=Y2Jx2yy4ibYJEv4BA@mail.gmail.com>
 <196fae0d-33dd-4f98-4355-7dfaae383971@cgl.ucsf.edu>
 <CAJn_aE7BMvmcYCRaFhu4VjwFDzN84tobsbAxLjKvyejHdqQPhw@mail.gmail.com>
Message-ID: <a284471e-c396-e2b5-7dd6-92c731d7fec2@cgl.ucsf.edu>

Have you updated the project since this:

Since joblib is involved here as well, I'd look at that checkin. Joblib 
expects there to be a model, maybe it is just configure to look in the 
wrong place.


On 1/25/19 10:54 AM, Liam Geron wrote:
> No such luck, the file doesn't seem to exist. Here's the output on my 
> local:*"ls: /tmp/model/0001/model.joblib: No such file or directory"*
> *
> *
> and *"/tmp/model/0001/model.joblib: cannot open 
> `/tmp/model/0001/model.joblib' (No such file or directory)"*
> *
> *
> and on the cloud shell: *"ls: cannot access 
> '/tmp/model/0001/model.joblib': No such file or directory"*
> *
> *
> and *"/bin/sh: 1: file: not found".*
>
> On Fri, Jan 25, 2019 at 1:29 PM Bill Ross <ross at cgl.ucsf.edu 
> <mailto:ross at cgl.ucsf.edu>> wrote:
>
>     Dumb generic cross-check from supporting compchem code in the day:
>     What do these give? Might yield a clue, e.g. all model files
>     seeing this got corrupted somehow.
>
>     $ file */tmp/model/0001/model.joblib*
>
>     *$ ls -l ***/tmp/model/0001/model.joblib**
>
>     **
>     **
>
>     On 1/25/19 9:26 AM, Liam Geron wrote:
>>     Hi scikit learn contributors,
>>
>>     I am currently attempting to transfer our preexisting models into
>>     cloud ML for scalability, however I am encountering bugs while
>>     running through some tutorial code found
>>     here?(https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb).
>>
>>     On both my local machine in a virtual environment and on the
>>     cloud shell I'm encountering errors when it comes to version
>>     creation and online prediction. For version creation on my local
>>     machine and on the cloud shell I'm encountering this error:
>>     *"ERROR: (gcloud.ml-engine.versions.create) Bad model detected
>>     with error:? "Failed to load model: Could not load the model:
>>     /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python
>>     3.6.4 (local) and Python 3.5.6 (cloud shell) when running the
>>     command:
>>
>>     *"gcloud ml-engine versions create $VERSION_NAME \*
>>     *? ? --model $MODEL_NAME \*
>>     *? ? --config config.yaml"*
>>
>>     Any help would be greatly appreciated.
>>
>>     Thank you,
>>     Liam Geron
>>
>>     _______________________________________________
>>     scikit-learn mailing list
>>     scikit-learn at python.org  <mailto:scikit-learn at python.org>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/950cefad/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bjpobekjinilbgej.png
Type: image/png
Size: 19872 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/950cefad/attachment-0001.png>

From liam at chatdesk.com  Fri Jan 25 15:16:49 2019
From: liam at chatdesk.com (Liam Geron)
Date: Fri, 25 Jan 2019 15:16:49 -0500
Subject: [scikit-learn] Google Cloud ML Error
In-Reply-To: <a284471e-c396-e2b5-7dd6-92c731d7fec2@cgl.ucsf.edu>
References: <CAJn_aE5Gvwa5hf9YLVz4Xo3mkN+Hfg1f=Y2Jx2yy4ibYJEv4BA@mail.gmail.com>
 <196fae0d-33dd-4f98-4355-7dfaae383971@cgl.ucsf.edu>
 <CAJn_aE7BMvmcYCRaFhu4VjwFDzN84tobsbAxLjKvyejHdqQPhw@mail.gmail.com>
 <a284471e-c396-e2b5-7dd6-92c731d7fec2@cgl.ucsf.edu>
Message-ID: <CAJn_aE4kaOk0nNPn-mESe91L3VSJT2Kixgk0+BrVJfeopcLdkA@mail.gmail.com>

As in updated the sklearn module or the joblib module? I'm currently
running sklearn on 0.19.1 and joblib on 0.13.1. Do I need to be running
them on a specific version?

On Fri, Jan 25, 2019 at 2:35 PM Bill Ross <ross at cgl.ucsf.edu> wrote:

> Have you updated the project since this:
>
> Since joblib is involved here as well, I'd look at that checkin. Joblib
> expects there to be a model, maybe it is just configure to look in the
> wrong place.
>
>
> On 1/25/19 10:54 AM, Liam Geron wrote:
>
> No such luck, the file doesn't seem to exist. Here's the output on my
> local:* "ls: /tmp/model/0001/model.joblib: No such file or directory"*
>
> and *"/tmp/model/0001/model.joblib: cannot open
> `/tmp/model/0001/model.joblib' (No such file or directory)"*
>
> and on the cloud shell: *"ls: cannot access
> '/tmp/model/0001/model.joblib': No such file or directory"*
>
> and *"/bin/sh: 1: file: not found".*
>
> On Fri, Jan 25, 2019 at 1:29 PM Bill Ross <ross at cgl.ucsf.edu> wrote:
>
>> Dumb generic cross-check from supporting compchem code in the day: What
>> do these give? Might yield a clue, e.g. all model files seeing this got
>> corrupted somehow.
>>
>> $ file */tmp/model/0001/model.joblib*
>>
>> *$ ls -l **/tmp/model/0001/model.joblib*
>>
>>
>> On 1/25/19 9:26 AM, Liam Geron wrote:
>>
>> Hi scikit learn contributors,
>>
>> I am currently attempting to transfer our preexisting models into cloud
>> ML for scalability, however I am encountering bugs while running through
>> some tutorial code found here (
>> https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb
>> ).
>>
>> On both my local machine in a virtual environment and on the cloud shell
>> I'm encountering errors when it comes to version creation and online
>> prediction. For version creation on my local machine and on the cloud shell
>> I'm encountering this error: *"ERROR: (gcloud.ml-engine.versions.create)
>> Bad model detected with error:  "Failed to load model: Could not load the
>> model: /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python
>> 3.6.4 (local) and Python 3.5.6 (cloud shell) when running the command:
>>
>> *"gcloud ml-engine versions create $VERSION_NAME \*
>> *    --model $MODEL_NAME \*
>> *    --config config.yaml"*
>>
>> Any help would be greatly appreciated.
>>
>> Thank you,
>> Liam Geron
>>
>> _______________________________________________
>> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
> _______________________________________________
> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/d7fd4038/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bjpobekjinilbgej.png
Type: image/png
Size: 19872 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/d7fd4038/attachment-0001.png>

From matti.v.viljamaa at gmail.com  Fri Jan 25 15:31:20 2019
From: matti.v.viljamaa at gmail.com (Matti Viljamaa)
Date: Fri, 25 Jan 2019 22:31:20 +0200
Subject: [scikit-learn] How to determine suitable cluster algo
In-Reply-To: <5c4af668.1c69fb81.ee649.a884@mx.google.com>
References: <CAB3eZfv=t-OYc=3Emx+qtCp+EaviP0YU-3tWe3dn8_A3=hNxFQ@mail.gmail.com>
 <5c498874.1c69fb81.a65c3.68df@mx.google.com>
 <CAB3eZfumGU28nink3HDfe55q+=kVbUhJVrRoJQFtKPrMN4at9g@mail.gmail.com>
 <5c4af668.1c69fb81.ee649.a884@mx.google.com>
Message-ID: <5c4b7219.1c69fb81.72c03.c685@mx.google.com>

Also,

Remember that some algos may exhibit ?sweet spots? w.r.t. computation time and gained accuracy.

So you might want to keep measuring ?explained variance?, while you add complexity to your models. And then do plots of model complexity vs explained variance.

E.g. in MLPClassifier you?d plot e.g. hidden layers against explained variance to figure out where adding hidden layers starts to exhibit lesser gain in explained variance.

L?hetetty Windows 10:n S?hk?postista

L?hett?j?: Matti Viljamaa
L?hetetty: Friday, 25 January 2019 13.43
Vastaanottaja: Scikit-learn mailing list
Aihe: VS: [scikit-learn] How to determine suitable cluster algo

For determining what one can afford computaionally, see e.g.:
https://stackoverflow.com/questions/22443041/predicting-how-long-an-scikit-learn-classification-will-take-to-run
https://www.reddit.com/r/scikit_learn/comments/a746h0/is_there_any_way_to_estimate_how_long_a_given/

L?hetetty Windows 10:n S?hk?postista

L?hett?j?: lampahome
L?hetetty: Friday, 25 January 2019 3.42
Vastaanottaja: Scikit-learn mailing list
Aihe: Re: [scikit-learn] How to determine suitable cluster algo

Maybe the suitable way is try-and-error?

What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples
That cost too much time for me.

Maybe I should define the initial number of cluster based on execution time?

Then analyze the next step is increase/decrease the number of cluster?

thx


Virus-free. www.avast.com 


---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/e0caa2ba/attachment.html>

From ross at cgl.ucsf.edu  Fri Jan 25 18:05:57 2019
From: ross at cgl.ucsf.edu (Bill Ross)
Date: Fri, 25 Jan 2019 15:05:57 -0800
Subject: [scikit-learn] Google Cloud ML Error
Message-ID: <8g2jw1kfh6uo8fntxyywcyn3.1548457557070@email.android.com>

I'm a kibitzer who never ran it myself, just a compulsive debugger looking at a basic possibility.

Bill

<div>-------- Original message --------</div><div>From: Liam Geron <liam at chatdesk.com> </div><div>Date:01/25/2019  12:16 PM  (GMT-08:00) </div><div>To: Scikit-learn mailing list <scikit-learn at python.org> </div><div>Subject: Re: [scikit-learn] Google Cloud ML Error </div><div>
</div>As in updated the sklearn module or the joblib module? I'm currently running sklearn on 0.19.1 and joblib on 0.13.1. Do I need to be running them on a specific version?

On Fri, Jan 25, 2019 at 2:35 PM Bill Ross <ross at cgl.ucsf.edu> wrote:
Have you updated the project since this:


Since joblib is involved here as well, I'd look at that checkin. Joblib expects there to be a model, maybe it is just configure to look in the wrong place.

On 1/25/19 10:54 AM, Liam Geron wrote:
No such luck, the file doesn't seem to exist. Here's the output on my local: "ls: /tmp/model/0001/model.joblib: No such file or directory"

and "/tmp/model/0001/model.joblib: cannot open `/tmp/model/0001/model.joblib' (No such file or directory)"

and on the cloud shell: "ls: cannot access '/tmp/model/0001/model.joblib': No such file or directory"

and "/bin/sh: 1: file: not found".

On Fri, Jan 25, 2019 at 1:29 PM Bill Ross <ross at cgl.ucsf.edu> wrote:
Dumb generic cross-check from supporting compchem code in               the day: What do these give? Might yield a clue, e.g. all               model files seeing this got corrupted somehow.
$ file /tmp/model/0001/model.joblib

$ ls -l /tmp/model/0001/model.joblib


On 1/25/19 9:26 AM, Liam Geron wrote:
Hi scikit learn contributors,

I am currently attempting to transfer our preexisting models into cloud ML for scalability, however I am encountering bugs while running through some tutorial code found here (https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb).

On both my local machine in a virtual environment and on the cloud shell I'm                         encountering errors when it comes to version creation and online prediction. For version creation on my local machine and on the cloud shell I'm encountering this error: "ERROR:                           (gcloud.ml-engine.versions.create) Bad model detected with error: "Failed to load model: Could not load the model: /tmp/model/0001/model.joblib. 32. (Error code: 0)"" with Python 3.6.4 (local) and Python 3.5.6 (cloud shell) when running the command: 

"gcloud ml-engine versions create                           $VERSION_NAME \
    --model $MODEL_NAME \
    --config config.yaml"

Any help would be greatly appreciated.

Thank you,
Liam Geron


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn at python.org
https://mail.python.org/mailman/listinfo/scikit-learn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/117d9bfe/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bjpobekjinilbgej.png
Type: image/png
Size: 19872 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/117d9bfe/attachment-0001.png>

From avigross at verizon.net  Fri Jan 25 21:34:09 2019
From: avigross at verizon.net (Avi Gross)
Date: Fri, 25 Jan 2019 21:34:09 -0500
Subject: [scikit-learn] How to determine suitable cluster algo
In-Reply-To: <5c4b7219.1c69fb81.72c03.c685@mx.google.com>
References: <CAB3eZfv=t-OYc=3Emx+qtCp+EaviP0YU-3tWe3dn8_A3=hNxFQ@mail.gmail.com>
 <5c498874.1c69fb81.a65c3.68df@mx.google.com>
 <CAB3eZfumGU28nink3HDfe55q+=kVbUhJVrRoJQFtKPrMN4at9g@mail.gmail.com>
 <5c4af668.1c69fb81.ee649.a884@mx.google.com>
 <5c4b7219.1c69fb81.72c03.c685@mx.google.com>
Message-ID: <005701d4b51f$9e270d00$da752700$@verizon.net>

My comments are at the end as some people do not like top posts.

 
From: scikit-learn <scikit-learn-bounces+avigross=verizon.net at python.org> On Behalf Of Matti Viljamaa
Sent: Friday, January 25, 2019 3:31 PM
To: Scikit-learn mailing list <scikit-learn at python.org>
Subject: Re: [scikit-learn] How to determine suitable cluster algo

 
Also,

 
Remember that some algos may exhibit ?sweet spots? w.r.t. computation time and gained accuracy.

 
So you might want to keep measuring ?explained variance?, while you add complexity to your models. And then do plots of model complexity vs explained variance.

 
E.g. in MLPClassifier you?d plot e.g. hidden layers against explained variance to figure out where adding hidden layers starts to exhibit lesser gain in explained variance.

 
L?hetetty Windows 10:n S?hk?posti <https://go.microsoft.com/fwlink/?LinkId=550986> sta

 
L?hett?j?: Matti Viljamaa <mailto:matti.v.viljamaa at gmail.com> 
L?hetetty: Friday, 25 January 2019 13.43
Vastaanottaja: Scikit-learn mailing list <mailto:scikit-learn at python.org> 
Aihe: VS: [scikit-learn] How to determine suitable cluster algo

 
For determining what one can afford computaionally, see e.g.:

https://stackoverflow.com/questions/22443041/predicting-how-long-an-scikit-learn-classification-will-take-to-run

https://www.reddit.com/r/scikit_learn/comments/a746h0/is_there_any_way_to_estimate_how_long_a_given/ <https://www.redditcom/r/scikit_learn/comments/a746h0/is_there_any_way_to_estimate_how_long_a_given/> 

 
L?hetetty Windows 10:n S?hk?posti <https://go.microsoft.com/fwlink/?LinkId=550986> sta

 
L?hett?j?: lampahome <mailto:pahome.chen at mirlab.org> 
L?hetetty: Friday, 25 January 2019 3.42
Vastaanottaja: Scikit-learn mailing list <mailto:scikit-learn at python.org> 
Aihe: Re: [scikit-learn] How to determine suitable cluster algo

 
Maybe the suitable way is try-and-error?

 
What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples

That cost too much time for me.

 
Maybe I should define the initial number of cluster based on execution time?

 
Then analyze the next step is increase/decrease the number of cluster?

 
thx

 
__COMMENT__

This is a question, not a suggestion.

 
The poster suggested they have such a large amount of data that looking for larger numbers of clusters to find a ?sweet? spot may take too much time.

 
Is there any value in taking a much smaller random sample of data that remains big enough and trying that on a reasonable range of clusters? The results would not be definitive but might supply a clue as to what range to try again with the full data.

 
As I see mentioned, the run time may not be going up if the data is constant and the number of clusters varies. I am not sure what clustering algorithms you want to use but for something like K-means with reasonable data, generally the number of clusters that show meaningful results are usually much smaller than the number of items in the data. The algorithms often terminate when successive runs show little change. This may likely be a tunable parameter. So if you ask it to make N+1 clusters it may even terminate sooner than for N if it is because that number of clusters more closely resembles the variation in the data. 

 
And, again, if you are using a K-means variant, it may be better to use some human intervention to see if a particular level of clustering fits some model you can make that explains what each cluster has in common. If you overfit, the number of clusters can effectively be the number of unique items in your data and probably has no meaningful purpose.

 
Again, just a question. There are algorithms out there that deal better with large data than others.

 
Avi

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190125/10b9ca8a/attachment.html>

From hamidizade.s at gmail.com  Sat Jan 26 12:24:02 2019
From: hamidizade.s at gmail.com (S Hamidizade)
Date: Sat, 26 Jan 2019 20:54:02 +0330
Subject: [scikit-learn] Imblearn: SMOTENC
In-Reply-To: <CACDxx9gb2THSdsY9UKk1oCrp3woRU_Pr1FeV_18L+URx7J6fAA@mail.gmail.com>
References: <CALx+=wvBtqiSXTR29W=ZQA_kx1Fz1LOdZNhZS=L7DAROfRpwXQ@mail.gmail.com>
 <CALx+=wu+wwukQP9OA2XfzKCCSGCMqPnn_PXaCyT_KtAnONnjCQ@mail.gmail.com>
 <CACDxx9gb2THSdsY9UKk1oCrp3woRU_Pr1FeV_18L+URx7J6fAA@mail.gmail.com>
Message-ID: <CALx+=wv3vTY-EkLSz0LXZX5=GpWc52XW8fmXXgpRAghGW8_pbg@mail.gmail.com>

Thanks. The code is provided here:
https://github.com/scikit-learn-contrib/imbalanced-learn/issues/537

Best regards,

On Thu, Jan 24, 2019 at 7:15 PM Guillaume Lema?tre <g.lemaitre58 at gmail.com>
wrote:

> You should open a ticket on imbalanced-learn GitHub issue. This is easier
> to post a reproducible example and for us to test it.
> From the error message, I can understand that you have 161 features and
> require a feature above the index 160.
>
>
>
> On Thu, 24 Jan 2019 at 16:19, S Hamidizade <hamidizade.s at gmail.com> wrote:
>
>> Thanks. Unfortunately, now the error is:
>> ValueError: Some of the categorical indices are out of range. Indices
>> should be between 0 and 160.
>> Best regards,
>>
>> On Sun, Jan 20, 2019 at 8:31 PM S Hamidizade <hamidizade.s at gmail.com>
>> wrote:
>>
>>> Dear Scikit-learners
>>> Hi.
>>>
>>> I would greatly appreciate if you could let me know how to use
>>> SMOTENC.  I wrote:
>>>
>>> num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values)
>>> cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values)
>>> print(len(num_indices1))
>>> print(len(cat_indices1))
>>>
>>> pipeline=Pipeline(steps= [
>>>     # Categorical features
>>>     ('feature_processing', FeatureUnion(transformer_list = [
>>>             ('categorical', MultiColumn(cat_indices1)),
>>>
>>>             #numeric
>>>             ('numeric', Pipeline(steps = [
>>>                 ('select', MultiColumn(num_indices1)),
>>>                 ('scale', StandardScaler())
>>>                         ]))
>>>         ])),
>>>     ('clf', rg)
>>>     ]
>>> )
>>>
>>> Therefore, as it is indicated I have 5 categorical features. Really,
>>> indices 123 to 160 are related to one categorical feature with 37 possible
>>> values which is converted into 37 columns using get_dummies.
>>>  Sorry, I think SMOTENC should be inserted before the classifier ('clf',
>>> reg) but I don't know how to define "categorical_features" in SMOTENC.
>>> Besides, could you please let me know where to use imblearn.pipeline?
>>>
>>> Thanks in advance.
>>> Best regards,
>>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190126/64609ed5/attachment.html>

From suryodaybasak at gmail.com  Sun Jan 27 01:25:18 2019
From: suryodaybasak at gmail.com (Suryoday Basak)
Date: Sun, 27 Jan 2019 00:25:18 -0600
Subject: [scikit-learn] Regarding GSOC and open source contributions
Message-ID: <CAJ0fXsPWvWpA2-2HuyROsuDx8G=NUxwTcvfTAkvmecM=U7AfOw@mail.gmail.com>

Dear Team,

Could you let me know if scikit-learn might be a GSOC organization this
year?

I have a few proposal ideas in mind and have been working to implement
certain methods over the existing project, and was wondering if I could
talk to someone about how to go about things.

Thank you. Regards,

*Suryoday Basak*

*Graduate Student*, Department of Computer Science and Engineering, *The
University of Texas at Arlington*

Website: *suryodaybasak.info <http://suryodaybasak.info>*
Follow me on Medium:
*https://medium.com/@suryodaybasak
<https://medium.com/@suryodaybasak>*Astroinformatics
Research Group: *http://astrirg.org <http://astrirg.org>*


* <http://ascl.net/code/v/1475>*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190127/683f3fa8/attachment-0001.html>

From liam at chatdesk.com  Mon Jan 28 10:28:40 2019
From: liam at chatdesk.com (Liam Geron)
Date: Mon, 28 Jan 2019 10:28:40 -0500
Subject: [scikit-learn] Google Cloud ML Engine Error with Sklearn
Message-ID: <CAJn_aE6Az+sawgtXJS3iW-WM=3BonNSM8j_8ouBfP=i65h1ttw@mail.gmail.com>

Hi scikit learn contributors,

I am currently attempting to transfer our preexisting models into cloud ML
for scalability, however I am encountering bugs while running through some
tutorial code found here (
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb
).

On both my local machine in a virtual environment and on the cloud shell
I'm encountering errors when it comes to version creation and online
prediction. For version creation on my local machine and on the cloud shell
I'm encountering this error: *"ERROR: (gcloud.ml-engine.versions.create)
Bad model detected with error:  "Failed to load model: Could not load the
model: /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python
3.6.4 (local) and Python 3.5.6 (cloud shell) when running the command:

*"gcloud ml-engine versions create $VERSION_NAME \*
*    --model $MODEL_NAME \*
*    --config config.yaml"*

This is running with joblib version "0.13.1" and sklearn version "0.19.1".

Any help would be greatly appreciated.

Thank you,
Liam Geron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190128/5d2eeac8/attachment.html>

From pahome.chen at mirlab.org  Tue Jan 29 05:35:50 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Tue, 29 Jan 2019 18:35:50 +0800
Subject: [scikit-learn] Is there rule to determine X and y when train
 regression?
Message-ID: <CAB3eZfsQLikf+wFfC1hOsVu65yU7p45T_eMmKJ=GQy2oGzHoqg@mail.gmail.com>

I found many example to predict stock, house prices, taxi fare...etc.

The field of y almost like below:
y : the price of the day
And X maybe the day, param which can affect price...etc.

Now I want to predict sales of multiple items of multiple stores.

Is suitable to let decrease/increase ratio of sales be y?

The reason I'm interesting is I don't know how to explain to other people
that price as y is normal.
So other people may have a question that can we let y be increase/decrease
ratio?

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190129/4f8f0203/attachment.html>

From mohit.srivastava at med.unideb.hu  Tue Jan 29 08:10:30 2019
From: mohit.srivastava at med.unideb.hu (Mohit Srivastava)
Date: Tue, 29 Jan 2019 14:10:30 +0100 (CET)
Subject: [scikit-learn] sklearn.cluster.OPTICS
Message-ID: <1288774727.8422712.1548767430428.JavaMail.zimbra@zimbra.unideb.hu>

Dear all,

I want to use your clustering algorithm "sklearn.cluster.OPTICS".
But it is not working and found that it's not available at the moment( found on the internet).
Could you please help me with the issue?
When would it be possible to use it?
Please reply as soon as possible.
thanks
regards
Mohit Srivastava

From adrin.jalali at gmail.com  Tue Jan 29 08:39:52 2019
From: adrin.jalali at gmail.com (Adrin)
Date: Tue, 29 Jan 2019 14:39:52 +0100
Subject: [scikit-learn] sklearn.cluster.OPTICS
In-Reply-To: <1288774727.8422712.1548767430428.JavaMail.zimbra@zimbra.unideb.hu>
References: <1288774727.8422712.1548767430428.JavaMail.zimbra@zimbra.unideb.hu>
Message-ID: <CAEOrW4_XdzOMPSuZn6PxrsikG2DUju+0y2atsiFf+9W+q3=U5g@mail.gmail.com>

Hi,

OPTICS is still under development and there are quite a few open issues and
PRs regarding the method. It's available on master, but not on any of the
releases yet. We will hopefully have it out for the next release.

Best,
Adrin.

On Tue, 29 Jan 2019 at 14:31 Mohit Srivastava <
mohit.srivastava at med.unideb.hu> wrote:

> Dear all,
>
> I want to use your clustering algorithm "sklearn.cluster.OPTICS".
> But it is not working and found that it's not available at the moment(
> found on the internet).
> Could you please help me with the issue?
> When would it be possible to use it?
> Please reply as soon as possible.
> thanks
> regards
> Mohit Srivastava
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190129/c4f2036e/attachment.html>

From pahome.chen at mirlab.org  Wed Jan 30 05:42:41 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Wed, 30 Jan 2019 18:42:41 +0800
Subject: [scikit-learn] Can y of datasets be increasing/decreasing ratio
 when train regression model?
Message-ID: <CAB3eZfuXDWvwEqVPZzLZL3dv4nJVDPQgm0KnF3vm5u4L89D4hg@mail.gmail.com>

I found many cases in kaggle to predict the quantity or trends. They all
set the real quantity as y.

But I have question is that does anyone set the changing ratio as y?

Like:

X     y
Day1  0.2
Day2  0.1
Day3  0.15
Day4  -0.1

y is the changing ratio compared with previous day.

Why anybody set the real quantity(ex: sales, car numbers...etc) as y rather
than changing ratio as y?

I want to know it's based on experience or other reasons

thx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190130/fec36346/attachment.html>

From charles.y.zheng at gmail.com  Wed Jan 30 12:09:45 2019
From: charles.y.zheng at gmail.com (Charles Zheng)
Date: Wed, 30 Jan 2019 12:09:45 -0500
Subject: [scikit-learn] Can y of datasets be increasing/decreasing ratio
 when train regression model?
In-Reply-To: <CAB3eZfuXDWvwEqVPZzLZL3dv4nJVDPQgm0KnF3vm5u4L89D4hg@mail.gmail.com>
References: <CAB3eZfuXDWvwEqVPZzLZL3dv4nJVDPQgm0KnF3vm5u4L89D4hg@mail.gmail.com>
Message-ID: <CAKer9aGktMRK3FK9H3R3gycjc=UA9gDUJ_fD+0N7b+kQaaYZ4A@mail.gmail.com>

Hi lampahome,

It is a common practice in financial modeling (
https://en.wikipedia.org/wiki/Capital_asset_pricing_model).

[image: image.png]

P_t is price at time t, R_t is "return", which is the variable they are
trying to predict.

Best,

Charles


On Wed, Jan 30, 2019 at 5:43 AM lampahome <pahome.chen at mirlab.org> wrote:

> I found many cases in kaggle to predict the quantity or trends. They all
> set the real quantity as y.
>
> But I have question is that does anyone set the changing ratio as y?
>
> Like:
>
> X     y
> Day1  0.2
> Day2  0.1
> Day3  0.15
> Day4  -0.1
>
> y is the changing ratio compared with previous day.
>
> Why anybody set the real quantity(ex: sales, car numbers...etc) as y
> rather than changing ratio as y?
>
> I want to know it's based on experience or other reasons
>
> thx
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190130/e5c9495f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 2928 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190130/e5c9495f/attachment.png>

From joel.nothman at gmail.com  Wed Jan 30 19:46:42 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 31 Jan 2019 11:46:42 +1100
Subject: [scikit-learn] Can y of datasets be increasing/decreasing ratio
 when train regression model?
In-Reply-To: <CAB3eZfuXDWvwEqVPZzLZL3dv4nJVDPQgm0KnF3vm5u4L89D4hg@mail.gmail.com>
References: <CAB3eZfuXDWvwEqVPZzLZL3dv4nJVDPQgm0KnF3vm5u4L89D4hg@mail.gmail.com>
Message-ID: <CAAkaFLX_eaZvtoN73wg1SPrBQJQzrSjWT9syfO0q8bB00pgxcw@mail.gmail.com>

Particular regressors may make assumptions about the distribution of y, or
its relationship with the features X. You should be aware of those
assumptions and reason about whether they are held well enough. A
TransformedTargetRegressor may be used to make your target better match
those assumptions, e.g. by trying to predict the logarithm or power
transform of the original targets, but again you need to look at the
distribution of y and the assumptions of the regressor.

On Wed, 30 Jan 2019 at 21:44, lampahome <pahome.chen at mirlab.org> wrote:

> I found many cases in kaggle to predict the quantity or trends. They all
> set the real quantity as y.
>
> But I have question is that does anyone set the changing ratio as y?
>
> Like:
>
> X     y
> Day1  0.2
> Day2  0.1
> Day3  0.15
> Day4  -0.1
>
> y is the changing ratio compared with previous day.
>
> Why anybody set the real quantity(ex: sales, car numbers...etc) as y
> rather than changing ratio as y?
>
> I want to know it's based on experience or other reasons
>
> thx
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190131/1b6b2426/attachment.html>

From pahome.chen at mirlab.org  Wed Jan 30 20:45:52 2019
From: pahome.chen at mirlab.org (lampahome)
Date: Thu, 31 Jan 2019 09:45:52 +0800
Subject: [scikit-learn] Can y of datasets be increasing/decreasing ratio
 when train regression model?
In-Reply-To: <CAAkaFLX_eaZvtoN73wg1SPrBQJQzrSjWT9syfO0q8bB00pgxcw@mail.gmail.com>
References: <CAB3eZfuXDWvwEqVPZzLZL3dv4nJVDPQgm0KnF3vm5u4L89D4hg@mail.gmail.com>
 <CAAkaFLX_eaZvtoN73wg1SPrBQJQzrSjWT9syfO0q8bB00pgxcw@mail.gmail.com>
Message-ID: <CAB3eZftrGVP=NCwZ0E2NSYLz1XuJK4NyHpchDqvTCh5MGGrqdw@mail.gmail.com>

> but again you need to look at the distribution of y and the assumptions of
> the regressor.
>
>
So in the first, Should I plot graph to check y is distribution when X
changes?
I'm just thinking about how to know if it's distribution.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190131/ccb6385f/attachment-0001.html>

From jaapvankampen at gmail.com  Thu Jan 31 04:51:36 2019
From: jaapvankampen at gmail.com (Jaap van Kampen)
Date: Thu, 31 Jan 2019 10:51:36 +0100
Subject: [scikit-learn] Bounded logistical regression in Python
Message-ID: <CAAeS_fNZAvQW1hnKNjEyZy5-3iFoj5TVJ+SeUWnkNacQtz1L7Q@mail.gmail.com>

Hi there!
The standard logistical regression solver
<https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/linear_model/logistic.py#L998>
in scikit-learn assumes the regression equation:<p>
P(X) = 1/ (1 + exp(b0 + b1*X1 + ... + bn*Xn))</p>
.. and solves for the b's using various solver routines.

For a specific project, I'd like to bound the regression equation between
0-a (instead of 0-1) and add a variable c to center an independent variable
Xk, e.g.<p>
P(X) = a / (1 + exp(b0 + b1*X1 + .. + bn*Xn) * (Xk  - c)))  </p>
... and solve for a, b's and c.

Any thoughts/ideas on how to modify logistic.py to achieve this? I thought
of modifying the expit
<https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.special.expit.html>function
to reflect the changed equation. But how do a let the solvers know to also
include the new variables a and c? Any scripts available that are able to
handle my modified logistic regression equation?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190131/bd50b549/attachment.html>

From joel.nothman at gmail.com  Thu Jan 31 05:48:43 2019
From: joel.nothman at gmail.com (Joel Nothman)
Date: Thu, 31 Jan 2019 21:48:43 +1100
Subject: [scikit-learn] Bounded logistical regression in Python
In-Reply-To: <CAAeS_fNZAvQW1hnKNjEyZy5-3iFoj5TVJ+SeUWnkNacQtz1L7Q@mail.gmail.com>
References: <CAAeS_fNZAvQW1hnKNjEyZy5-3iFoj5TVJ+SeUWnkNacQtz1L7Q@mail.gmail.com>
Message-ID: <CAAkaFLUs9S1-sk2oX=AY82N+G1WJt7je7MaRZny5U0=MQdL7Jg@mail.gmail.com>

I don't quite get your terminology, to "add a variable c to center an
independent variable Xk", and you've got an extra ) in your equation, so
I'm not sure exactly where you want it... If you mean

P(X) = a / (1 + exp(b0 + b1*X1 + .. + bn*Xn) * (Xk  - c))

then that's the same as

P(X) = a / (1 + exp(b0 + b1*X1 + .. + bn*Xn + log(Xk)/log(c))

replace c by exp(1/bk) and you've got the same old logistic regression,
haven't you?

On Thu, 31 Jan 2019 at 20:53, Jaap van Kampen <jaapvankampen at gmail.com>
wrote:

> Hi there!
> The standard logistical regression solver
> <https://github.com/scikit-learn/scikit-learn/blob/7389dba/sklearn/linear_model/logistic.py#L998>
> in scikit-learn assumes the regression equation:<p>
> P(X) = 1/ (1 + exp(b0 + b1*X1 + ... + bn*Xn))</p>
> .. and solves for the b's using various solver routines.
>
> For a specific project, I'd like to bound the regression equation between
> 0-a (instead of 0-1) and add a variable c to center an independent variable
> Xk, e.g.<p>
> P(X) = a / (1 + exp(b0 + b1*X1 + .. + bn*Xn) * (Xk  - c)))  </p>
> ... and solve for a, b's and c.
>
> Any thoughts/ideas on how to modify logistic.py to achieve this? I thought
> of modifying the expit
> <https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.special.expit.html>function
> to reflect the changed equation. But how do a let the solvers know to also
> include the new variables a and c? Any scripts available that are able to
> handle my modified logistic regression equation?
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20190131/3eddb78b/attachment.html>