From pahome.chen at mirlab.org Thu Jan 3 22:44:44 2019 From: pahome.chen at mirlab.org (lampahome) Date: Fri, 4 Jan 2019 11:44:44 +0800 Subject: [scikit-learn] How GridSearchCV to get best_params? Message-ID: as title In the doc it says: best_params_ : dict Parameter setting that gave the best results on the hold out data. My question is what is the hold out data? It's score of train data or test data, or mean of train and test score? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Thu Jan 3 22:50:16 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Thu, 3 Jan 2019 21:50:16 -0600 Subject: [scikit-learn] How GridSearchCV to get best_params? In-Reply-To: References: Message-ID: <21200DB3-457F-445B-B00F-12EF55F02908@sebastianraschka.com> I think it refers to the test folds via the k-fold cross-validation that is internally used via the `cv` parameter of GridSearchCV (or the test folds of an alternative cross validation scheme that you may pass as an iterator to cv) Best, Sebastian > On Jan 3, 2019, at 9:44 PM, lampahome wrote: > > as title > > In the doc it says: > > best_params_ : dict > Parameter setting that gave the best results on the hold out data. > > My question is what is the hold out data? > It's score of train data or test data, or mean of train and test score? > > thx > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From joel.nothman at gmail.com Sat Jan 5 05:32:28 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Sat, 5 Jan 2019 21:32:28 +1100 Subject: [scikit-learn] How GridSearchCV to get best_params? In-Reply-To: <21200DB3-457F-445B-B00F-12EF55F02908@sebastianraschka.com> References: <21200DB3-457F-445B-B00F-12EF55F02908@sebastianraschka.com> Message-ID: See cv_results_['mean_test_score'] (or 'mean_test_x' where 'x' is the scorer named in the refit parameter). -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Mon Jan 7 16:38:44 2019 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Mon, 7 Jan 2019 22:38:44 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com> References: <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com> Message-ID: <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org> Hi everybody and happy new year, We let this thread about the sprint die. I hope that this didn't change people's plans. So, it seems that the week of Feb 25th is a good week. I'll assume that it's good for most and start planning from there (if it's not the case, let me know). I've started our classic sprint-planing wiki page: https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events It's not rocket science, but it's better than an email thread to keep information together. It would be great if people could add their name, and if they need funding. We need to evaluate if we need to find funding. Also, it's quite soon, so maybe it would be good to start planning accommodation and travel :$. Cheers, Ga?l On Sat, Dec 22, 2018 at 05:27:39PM +0100, Guillaume Lema?tre wrote: > Works for me as well. > Sent from my phone - sorry to be brief and potential misspell. > ? Original Message ? > From: scikit-learn at python.org > Sent: 22 December 2018 17:17 > To: scikit-learn at python.org > Reply to: rth.yurchak at pm.me; scikit-learn at python.org > Cc: rth.yurchak at pm.me > Subject: Re: [scikit-learn] Next Sprint > That works for me as well. > On 21/12/2018 16:00, Olivier Grisel wrote: > > Ok for me. The last 3 weeks of February are fine for me. > > Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort > > > a ?crit?: > >???? ok for me > >???? Alex > >???? On Thu, Dec 20, 2018 at 8:35 PM Adrin >???? > wrote: > >????? > > >????? > It'll be the least favourable week of February for me, but I can > >???? make do. > >????? > > >????? > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller >???? > wrote: > >????? >> > >????? >> Works for me! > >????? >> > >????? >> On 12/19/18 5:33 PM, Gael Varoquaux wrote: > >????? >> > I would propose? the week of Feb 25th, as I heard people say > >???? that they > >????? >> > might be available at this time. It is good for many people, > >???? or should we > >????? >> > organize a doodle? > >????? >> > > >????? >> > G > >????? >> > > >????? >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: > >????? >> >> Can we please nail down dates for a sprint? > >????? >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote: > >????? >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: > >????? >> >>>> We can also do Paris in April / May or June if that's ok > >???? with Joel and better > >????? >> >>>> for Andreas. > >????? >> >>> Absolutely. > >????? >> >>> My thoughts here are that I want to minimize transportation, > >???? partly > >????? >> >>> because flying has a large carbon footprint. Also, for > >???? personal reasons, > >????? >> >>> I am not sure that I will be able to make it to Austin in > >???? July, but I > >????? >> >>> realize that this is a pretty bad argument. > >????? >> >>> We're happy to try to host in Paris whenever it's most > >???? convenient and to > >????? >> >>> try to help with travel for those not in Paris. > >????? >> >>> Ga?l > >????? >> >>> _______________________________________________ > >????? >> >>> scikit-learn mailing list > >????? >> >>> scikit-learn at python.org > >????? >> >>> https://mail.python.org/mailman/listinfo/scikit-learn > >????? >> >> _______________________________________________ > >????? >> >> scikit-learn mailing list > >????? >> >> scikit-learn at python.org > >????? >> >> https://mail.python.org/mailman/listinfo/scikit-learn > >????? >> > >????? >> _______________________________________________ > >????? >> scikit-learn mailing list > >????? >> scikit-learn at python.org > >????? >> https://mail.python.org/mailman/listinfo/scikit-learn > >????? > > >????? > _______________________________________________ > >????? > scikit-learn mailing list > >????? > scikit-learn at python.org > >????? > https://mail.python.org/mailman/listinfo/scikit-learn > >???? _______________________________________________ > >???? scikit-learn mailing list > >???? scikit-learn at python.org > >???? https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From pisymbol at gmail.com Mon Jan 7 23:50:49 2019 From: pisymbol at gmail.com (pisymbol) Date: Mon, 7 Jan 2019 23:50:49 -0500 Subject: [scikit-learn] LogisticRegression coef_ greater than n_features? Message-ID: According to the doc (0.20.2) the coef_ variables are suppose to be shape (1, n_features) for binary classification. Well I created a Pipeline and performed a GridSearchCV to create a LogisticRegresion model that does fairly well. However, when I want to rank feature importance I noticed that my coefs_ for my best_estimator_ has 24 entries while my training data has 22. What am I missing? How could coef_ > n_features? -aps -------------- next part -------------- An HTML attachment was scrubbed... URL: From pisymbol at gmail.com Tue Jan 8 00:02:17 2019 From: pisymbol at gmail.com (pisymbol) Date: Tue, 8 Jan 2019 00:02:17 -0500 Subject: [scikit-learn] LogisticRegression coef_ greater than n_features? In-Reply-To: References: Message-ID: On Mon, Jan 7, 2019 at 11:50 PM pisymbol wrote: > According to the doc (0.20.2) the coef_ variables are suppose to be shape > (1, n_features) for binary classification. Well I created a Pipeline and > performed a GridSearchCV to create a LogisticRegresion model that does > fairly well. However, when I want to rank feature importance I noticed that > my coefs_ for my best_estimator_ has 24 entries while my training data has > 22. > > What am I missing? How could coef_ > n_features? > > Just a follow-up, I am using a OneHotEncoder to encode two categoricals as part of my pipeline (I am also using an imputer/standard scaler too but I don't see how that could add features). Could my pipeline actually add two more features during fitting? -aps -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Mon Jan 7 23:54:50 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Mon, 7 Jan 2019 22:54:50 -0600 Subject: [scikit-learn] LogisticRegression coef_ greater than n_features? In-Reply-To: References: Message-ID: <2A93A0B0-359D-4C30-9ED7-2A166926E0F6@sebastianraschka.com> Maybe check a) if the actual labels of the training examples don't start at 0 b) if you have gaps, e.g,. if your unique training labels are 0, 1, 4, ..., 23 Best, Sebastian > On Jan 7, 2019, at 10:50 PM, pisymbol wrote: > > According to the doc (0.20.2) the coef_ variables are suppose to be shape (1, n_features) for binary classification. Well I created a Pipeline and performed a GridSearchCV to create a LogisticRegresion model that does fairly well. However, when I want to rank feature importance I noticed that my coefs_ for my best_estimator_ has 24 entries while my training data has 22. > > What am I missing? How could coef_ > n_features? > > -aps > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From mail at sebastianraschka.com Tue Jan 8 00:32:22 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Mon, 7 Jan 2019 23:32:22 -0600 Subject: [scikit-learn] LogisticRegression coef_ greater than n_features? In-Reply-To: References: Message-ID: <1061B4E0-615B-4658-B8F6-946D7CBAAD94@sebastianraschka.com> E.g, if you have a feature with values 'a' , 'b', 'c', then applying the one hot encoder will transform this into 3 features. Best, Sebastian > On Jan 7, 2019, at 11:02 PM, pisymbol wrote: > > > > On Mon, Jan 7, 2019 at 11:50 PM pisymbol wrote: > According to the doc (0.20.2) the coef_ variables are suppose to be shape (1, n_features) for binary classification. Well I created a Pipeline and performed a GridSearchCV to create a LogisticRegresion model that does fairly well. However, when I want to rank feature importance I noticed that my coefs_ for my best_estimator_ has 24 entries while my training data has 22. > > What am I missing? How could coef_ > n_features? > > > Just a follow-up, I am using a OneHotEncoder to encode two categoricals as part of my pipeline (I am also using an imputer/standard scaler too but I don't see how that could add features). > > Could my pipeline actually add two more features during fitting? > > -aps > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From qinhanmin2005 at sina.com Tue Jan 8 08:13:39 2019 From: qinhanmin2005 at sina.com (Hanmin Qin) Date: Tue, 08 Jan 2019 21:13:39 +0800 Subject: [scikit-learn] Next Sprint Message-ID: <20190108131339.3B9965D0009B@webmail.sinamail.sina.com.cn> Apologies I won't be available because of school work.Thanks the whole community for your great help. I'll continue to contribute and keep online during the sprint. Hanmin Qin ----- Original Message ----- From: Gael Varoquaux To: Scikit-learn mailing list Subject: Re: [scikit-learn] Next Sprint Date: 2019-01-08 05:40 Hi everybody and happy new year, We let this thread about the sprint die. I hope that this didn't change people's plans. So, it seems that the week of Feb 25th is a good week. I'll assume that it's good for most and start planning from there (if it's not the case, let me know). I've started our classic sprint-planing wiki page: https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events It's not rocket science, but it's better than an email thread to keep information together. It would be great if people could add their name, and if they need funding. We need to evaluate if we need to find funding. Also, it's quite soon, so maybe it would be good to start planning accommodation and travel :$. Cheers, Ga?l On Sat, Dec 22, 2018 at 05:27:39PM +0100, Guillaume Lema?tre wrote: > Works for me as well. > Sent from my phone - sorry to be brief and potential misspell. > Original Message > From: scikit-learn at python.org > Sent: 22 December 2018 17:17 > To: scikit-learn at python.org > Reply to: rth.yurchak at pm.me; scikit-learn at python.org > Cc: rth.yurchak at pm.me > Subject: Re: [scikit-learn] Next Sprint > That works for me as well. > On 21/12/2018 16:00, Olivier Grisel wrote: > > Ok for me. The last 3 weeks of February are fine for me. > > Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort > > > a ?crit : > > ok for me > > Alex > > On Thu, Dec 20, 2018 at 8:35 PM Adrin > > wrote: > > > > > > It'll be the least favourable week of February for me, but I can > > make do. > > > > > > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller > > wrote: > > >> > > >> Works for me! > > >> > > >> On 12/19/18 5:33 PM, Gael Varoquaux wrote: > > >> > I would propose the week of Feb 25th, as I heard people say > > that they > > >> > might be available at this time. It is good for many people, > > or should we > > >> > organize a doodle? > > >> > > > >> > G > > >> > > > >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: > > >> >> Can we please nail down dates for a sprint? > > >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote: > > >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: > > >> >>>> We can also do Paris in April / May or June if that's ok > > with Joel and better > > >> >>>> for Andreas. > > >> >>> Absolutely. > > >> >>> My thoughts here are that I want to minimize transportation, > > partly > > >> >>> because flying has a large carbon footprint. Also, for > > personal reasons, > > >> >>> I am not sure that I will be able to make it to Austin in > > July, but I > > >> >>> realize that this is a pretty bad argument. > > >> >>> We're happy to try to host in Paris whenever it's most > > convenient and to > > >> >>> try to help with travel for those not in Paris. > > >> >>> Ga?l > > >> >>> _______________________________________________ > > >> >>> scikit-learn mailing list > > >> >>> scikit-learn at python.org > > >> >>> https://mail.python.org/mailman/listinfo/scikit-learn > > >> >> _______________________________________________ > > >> >> scikit-learn mailing list > > >> >> scikit-learn at python.org > > >> >> https://mail.python.org/mailman/listinfo/scikit-learn > > >> > > >> _______________________________________________ > > >> scikit-learn mailing list > > >> scikit-learn at python.org > > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From astha31agarwal at gmail.com Tue Jan 8 09:26:25 2019 From: astha31agarwal at gmail.com (Astha Agarwal) Date: Tue, 8 Jan 2019 09:26:25 -0500 Subject: [scikit-learn] Using sklearn-crfsuite on Production Systems Message-ID: Hi, I'm wondering if anyone is using sklearn-crfsuite on production systems? Is this library suitable for usage in industry on production systems (and not academia) for non-big data problems? Thanks, Astha -------------- next part -------------- An HTML attachment was scrubbed... URL: From pisymbol at gmail.com Tue Jan 8 09:51:20 2019 From: pisymbol at gmail.com (pisymbol) Date: Tue, 8 Jan 2019 09:51:20 -0500 Subject: [scikit-learn] LogisticRegression coef_ greater than n_features? In-Reply-To: <1061B4E0-615B-4658-B8F6-946D7CBAAD94@sebastianraschka.com> References: <1061B4E0-615B-4658-B8F6-946D7CBAAD94@sebastianraschka.com> Message-ID: If that is the case, what order are the coefficients in then? -aps On Tue, Jan 8, 2019 at 12:48 AM Sebastian Raschka wrote: > E.g, if you have a feature with values 'a' , 'b', 'c', then applying the > one hot encoder will transform this into 3 features. > > Best, > Sebastian > > > On Jan 7, 2019, at 11:02 PM, pisymbol wrote: > > > > > > > > On Mon, Jan 7, 2019 at 11:50 PM pisymbol wrote: > > According to the doc (0.20.2) the coef_ variables are suppose to be > shape (1, n_features) for binary classification. Well I created a Pipeline > and performed a GridSearchCV to create a LogisticRegresion model that does > fairly well. However, when I want to rank feature importance I noticed that > my coefs_ for my best_estimator_ has 24 entries while my training data has > 22. > > > > What am I missing? How could coef_ > n_features? > > > > > > Just a follow-up, I am using a OneHotEncoder to encode two categoricals > as part of my pipeline (I am also using an imputer/standard scaler too but > I don't see how that could add features). > > > > Could my pipeline actually add two more features during fitting? > > > > -aps > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pisymbol at gmail.com Tue Jan 8 10:33:04 2019 From: pisymbol at gmail.com (pisymbol) Date: Tue, 8 Jan 2019 10:33:04 -0500 Subject: [scikit-learn] LogisticRegression coef_ greater than n_features? In-Reply-To: References: <1061B4E0-615B-4658-B8F6-946D7CBAAD94@sebastianraschka.com> Message-ID: Also Sebastian, I have binary classes but they are strings: clf.classes_: array(['American', 'Southwest'], dtype=object) On Tue, Jan 8, 2019 at 9:51 AM pisymbol wrote: > If that is the case, what order are the coefficients in then? > > -aps > > On Tue, Jan 8, 2019 at 12:48 AM Sebastian Raschka < > mail at sebastianraschka.com> wrote: > >> E.g, if you have a feature with values 'a' , 'b', 'c', then applying the >> one hot encoder will transform this into 3 features. >> >> Best, >> Sebastian >> >> > On Jan 7, 2019, at 11:02 PM, pisymbol wrote: >> > >> > >> > >> > On Mon, Jan 7, 2019 at 11:50 PM pisymbol wrote: >> > According to the doc (0.20.2) the coef_ variables are suppose to be >> shape (1, n_features) for binary classification. Well I created a Pipeline >> and performed a GridSearchCV to create a LogisticRegresion model that does >> fairly well. However, when I want to rank feature importance I noticed that >> my coefs_ for my best_estimator_ has 24 entries while my training data has >> 22. >> > >> > What am I missing? How could coef_ > n_features? >> > >> > >> > Just a follow-up, I am using a OneHotEncoder to encode two categoricals >> as part of my pipeline (I am also using an imputer/standard scaler too but >> I don't see how that could add features). >> > >> > Could my pipeline actually add two more features during fitting? >> > >> > -aps >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Tue Jan 8 20:07:03 2019 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Tue, 8 Jan 2019 19:07:03 -0600 Subject: [scikit-learn] LogisticRegression coef_ greater than n_features? In-Reply-To: References: <1061B4E0-615B-4658-B8F6-946D7CBAAD94@sebastianraschka.com> Message-ID: <4A92B9C5-9E20-48CE-A42A-261ABE720505@sebastianraschka.com> It seems like it's determined by the order in which they occur in the training set. E.g., from sklearn.preprocessing import OneHotEncoder import numpy as np x = np.array([['b'], ['a'], ['b']]) ohe = OneHotEncoder() xt = ohe.fit_transform(x) xt.todense() matrix([[0., 1.], [1., 0.], [0., 1.]]) and x = np.array([['a'], ['b'], ['a']]) ohe = OneHotEncoder() xt = ohe.fit_transform(x) xt.todense() matrix([[1., 0.], [0., 1.], [1., 0.]]) Not sure how you used the OHE, but you also want to make sure that you only use it on those columns that are indeed categorical, e.g., note the following behavior: x = np.array([['a', 1.1], ['b', 1.2], ['a', 1.3]]) ohe = OneHotEncoder() xt = ohe.fit_transform(x) xt.todense() matrix([[1., 0., 1., 0., 0.], [0., 1., 0., 1., 0.], [1., 0., 0., 0., 1.]]) Best, Sebastian > On Jan 8, 2019, at 9:33 AM, pisymbol wrote: > > Also Sebastian, I have binary classes but they are strings: > > clf.classes_: > array(['American', 'Southwest'], dtype=object) > > > > On Tue, Jan 8, 2019 at 9:51 AM pisymbol wrote: > If that is the case, what order are the coefficients in then? > > -aps > > On Tue, Jan 8, 2019 at 12:48 AM Sebastian Raschka wrote: > E.g, if you have a feature with values 'a' , 'b', 'c', then applying the one hot encoder will transform this into 3 features. > > Best, > Sebastian > > > On Jan 7, 2019, at 11:02 PM, pisymbol wrote: > > > > > > > > On Mon, Jan 7, 2019 at 11:50 PM pisymbol wrote: > > According to the doc (0.20.2) the coef_ variables are suppose to be shape (1, n_features) for binary classification. Well I created a Pipeline and performed a GridSearchCV to create a LogisticRegresion model that does fairly well. However, when I want to rank feature importance I noticed that my coefs_ for my best_estimator_ has 24 entries while my training data has 22. > > > > What am I missing? How could coef_ > n_features? > > > > > > Just a follow-up, I am using a OneHotEncoder to encode two categoricals as part of my pipeline (I am also using an imputer/standard scaler too but I don't see how that could add features). > > > > Could my pipeline actually add two more features during fitting? > > > > -aps > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From pahome.chen at mirlab.org Tue Jan 8 20:23:32 2019 From: pahome.chen at mirlab.org (lampahome) Date: Wed, 9 Jan 2019 09:23:32 +0800 Subject: [scikit-learn] Does sklearn contain xgboost? Message-ID: As title Does sklearn contain xgboost to use? thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: From niourf at gmail.com Tue Jan 8 21:03:01 2019 From: niourf at gmail.com (Nicolas Hug) Date: Tue, 8 Jan 2019 21:03:01 -0500 Subject: [scikit-learn] Does sklearn contain xgboost? In-Reply-To: References: Message-ID: <1f0c4259-6e73-61ab-f6d1-a16ef7b5811f@gmail.com> XGBoost is a specific implementation of gradient boosting trees, so strictly speaking scikit-learn cannot "contain" XGBoost. That being said: - XGBoost has a scikit-learn compatible API: https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn. So does LightGBM, another fast implementation of gradient boosting trees. - scikit-learn implements "vanilla" gradient boosting https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting - There's an open PR in scikit learn (still very WIP) that implements the same kind of optimization that XGBoost and LightGBM use, which will make GBDT faster https://github.com/scikit-learn/scikit-learn/pull/12807. Nicolas On 1/8/19 8:23 PM, lampahome wrote: > As title > > Does sklearn contain xgboost to use? > > thanks > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Jan 9 14:09:58 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 9 Jan 2019 14:09:58 -0500 Subject: [scikit-learn] Next Sprint In-Reply-To: <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org> References: <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com> <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org> Message-ID: <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com> Great, thanks for finalizing! It would be good to get some vague estimate of funding. I can probably provide some, though I'm in the process of hiring Thomas Fan, which might tie up some of my funds. Ga?l, does the foundation have funds and do you want to use them? And/or do you/INRA have funds you want to use? On 1/7/19 4:38 PM, Gael Varoquaux wrote: > Hi everybody and happy new year, > > We let this thread about the sprint die. I hope that this didn't change > people's plans. > > So, it seems that the week of Feb 25th is a good week. I'll assume that > it's good for most and start planning from there (if it's not the case, > let me know). > > I've started our classic sprint-planing wiki page: > https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events > It's not rocket science, but it's better than an email thread to keep > information together. > > It would be great if people could add their name, and if they need > funding. We need to evaluate if we need to find funding. > > Also, it's quite soon, so maybe it would be good to start planning > accommodation and travel :$. > > Cheers, > > Ga?l > > On Sat, Dec 22, 2018 at 05:27:39PM +0100, Guillaume Lema?tre wrote: >> Works for me as well. >> Sent from my phone - sorry to be brief and potential misspell. > >> ? Original Message >> From: scikit-learn at python.org >> Sent: 22 December 2018 17:17 >> To: scikit-learn at python.org >> Reply to: rth.yurchak at pm.me; scikit-learn at python.org >> Cc: rth.yurchak at pm.me >> Subject: Re: [scikit-learn] Next Sprint >> That works for me as well. >> On 21/12/2018 16:00, Olivier Grisel wrote: >>> Ok for me. The last 3 weeks of February are fine for me. >>> Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort >>> > a ?crit?: >>> ???? ok for me >>> ???? Alex >>> ???? On Thu, Dec 20, 2018 at 8:35 PM Adrin >> ???? > wrote: >>> ????? > >>> ????? > It'll be the least favourable week of February for me, but I can >>> ???? make do. >>> ????? > >>> ????? > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller >> ???? > wrote: >>> ????? >> >>> ????? >> Works for me! >>> ????? >> >>> ????? >> On 12/19/18 5:33 PM, Gael Varoquaux wrote: >>> ????? >> > I would propose? the week of Feb 25th, as I heard people say >>> ???? that they >>> ????? >> > might be available at this time. It is good for many people, >>> ???? or should we >>> ????? >> > organize a doodle? >>> ????? >> > >>> ????? >> > G >>> ????? >> > >>> ????? >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: >>> ????? >> >> Can we please nail down dates for a sprint? >>> ????? >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote: >>> ????? >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: >>> ????? >> >>>> We can also do Paris in April / May or June if that's ok >>> ???? with Joel and better >>> ????? >> >>>> for Andreas. >>> ????? >> >>> Absolutely. >>> ????? >> >>> My thoughts here are that I want to minimize transportation, >>> ???? partly >>> ????? >> >>> because flying has a large carbon footprint. Also, for >>> ???? personal reasons, >>> ????? >> >>> I am not sure that I will be able to make it to Austin in >>> ???? July, but I >>> ????? >> >>> realize that this is a pretty bad argument. >>> ????? >> >>> We're happy to try to host in Paris whenever it's most >>> ???? convenient and to >>> ????? >> >>> try to help with travel for those not in Paris. >>> ????? >> >>> Ga?l >>> ????? >> >>> _______________________________________________ >>> ????? >> >>> scikit-learn mailing list >>> ????? >> >>> scikit-learn at python.org >>> ????? >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> ????? >> >> _______________________________________________ >>> ????? >> >> scikit-learn mailing list >>> ????? >> >> scikit-learn at python.org >>> ????? >> >> https://mail.python.org/mailman/listinfo/scikit-learn >>> ????? >> >>> ????? >> _______________________________________________ >>> ????? >> scikit-learn mailing list >>> ????? >> scikit-learn at python.org >>> ????? >> https://mail.python.org/mailman/listinfo/scikit-learn >>> ????? > >>> ????? > _______________________________________________ >>> ????? > scikit-learn mailing list >>> ????? > scikit-learn at python.org >>> ????? > https://mail.python.org/mailman/listinfo/scikit-learn >>> ???? _______________________________________________ >>> ???? scikit-learn mailing list >>> ???? scikit-learn at python.org >>> ???? https://mail.python.org/mailman/listinfo/scikit-learn > >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn From pahome.chen at mirlab.org Thu Jan 10 03:47:14 2019 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 10 Jan 2019 16:47:14 +0800 Subject: [scikit-learn] Any clustering algo to cluster by the ratio of series data ? Message-ID: Cluster algo cluster samples by calculating the euclidean distance. I wonder if any clustering algo can cluster the series data? EX: Every items has there sold numbers of everyday. Item,Day1,Day2,Day3,Day4,Day5 A,1,5,1,5,1 B,10,50,10,50,10, C,4,70,30,10,50 The difference ratio of A and B are 500%,20%,500%,20%, I want to make A&B be the same cluster, C is another one. If I don't want to calculate the difference ratio of each samples Is there anyway to cluster by the difference ratio of each samples? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Thu Jan 10 10:34:08 2019 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 10 Jan 2019 16:34:08 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com> References: <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com> <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org> <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com> Message-ID: <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org> On Wed, Jan 09, 2019 at 02:09:58PM -0500, Andreas Mueller wrote: > Ga?l, does the foundation have funds and do you want to use them? > And/or do you/INRA have funds you want to use? Neither myself nor Inria has fund to use outside the foundation. The foundation can commit money if needed. We tend to prefer spending it on paying senior people to work on the project, as it is the bottleneck (we are still recruiting, by the way), but such a sprint is important. We will also apply for sprint-specific funding sources. If we can lighten-up your budget, so that you can pay awesome people to work on the project, it is a good thing. Ga?l > On 1/7/19 4:38 PM, Gael Varoquaux wrote: > > Hi everybody and happy new year, > > We let this thread about the sprint die. I hope that this didn't change > > people's plans. > > So, it seems that the week of Feb 25th is a good week. I'll assume that > > it's good for most and start planning from there (if it's not the case, > > let me know). > > I've started our classic sprint-planing wiki page: > > https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events > > It's not rocket science, but it's better than an email thread to keep > > information together. > > It would be great if people could add their name, and if they need > > funding. We need to evaluate if we need to find funding. > > Also, it's quite soon, so maybe it would be good to start planning > > accommodation and travel :$. > > Cheers, > > Ga?l > > On Sat, Dec 22, 2018 at 05:27:39PM +0100, Guillaume Lema?tre wrote: > > > Works for me as well. > > > Sent from my phone - sorry to be brief and potential misspell. > > > ? Original Message > > > From: scikit-learn at python.org > > > Sent: 22 December 2018 17:17 > > > To: scikit-learn at python.org > > > Reply to: rth.yurchak at pm.me; scikit-learn at python.org > > > Cc: rth.yurchak at pm.me > > > Subject: Re: [scikit-learn] Next Sprint > > > That works for me as well. > > > On 21/12/2018 16:00, Olivier Grisel wrote: > > > > Ok for me. The last 3 weeks of February are fine for me. > > > > Le jeu. 20 d?c. 2018 ? 21:21, Alexandre Gramfort > > > > > a ?crit?: > > > > ???? ok for me > > > > ???? Alex > > > > ???? On Thu, Dec 20, 2018 at 8:35 PM Adrin > > > ???? > wrote: > > > > ????? > > > > > ????? > It'll be the least favourable week of February for me, but I can > > > > ???? make do. > > > > ????? > > > > > ????? > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller > > > ???? > wrote: > > > > ????? >> > > > > ????? >> Works for me! > > > > ????? >> > > > > ????? >> On 12/19/18 5:33 PM, Gael Varoquaux wrote: > > > > ????? >> > I would propose? the week of Feb 25th, as I heard people say > > > > ???? that they > > > > ????? >> > might be available at this time. It is good for many people, > > > > ???? or should we > > > > ????? >> > organize a doodle? > > > > ????? >> > > > > > ????? >> > G > > > > ????? >> > > > > > ????? >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote: > > > > ????? >> >> Can we please nail down dates for a sprint? > > > > ????? >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote: > > > > ????? >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: > > > > ????? >> >>>> We can also do Paris in April / May or June if that's ok > > > > ???? with Joel and better > > > > ????? >> >>>> for Andreas. > > > > ????? >> >>> Absolutely. > > > > ????? >> >>> My thoughts here are that I want to minimize transportation, > > > > ???? partly > > > > ????? >> >>> because flying has a large carbon footprint. Also, for > > > > ???? personal reasons, > > > > ????? >> >>> I am not sure that I will be able to make it to Austin in > > > > ???? July, but I > > > > ????? >> >>> realize that this is a pretty bad argument. > > > > ????? >> >>> We're happy to try to host in Paris whenever it's most > > > > ???? convenient and to > > > > ????? >> >>> try to help with travel for those not in Paris. > > > > ????? >> >>> Ga?l > > > > ????? >> >>> _______________________________________________ > > > > ????? >> >>> scikit-learn mailing list > > > > ????? >> >>> scikit-learn at python.org > > > > ????? >> >>> https://mail.python.org/mailman/listinfo/scikit-learn > > > > ????? >> >> _______________________________________________ > > > > ????? >> >> scikit-learn mailing list > > > > ????? >> >> scikit-learn at python.org > > > > ????? >> >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > ????? >> > > > > ????? >> _______________________________________________ > > > > ????? >> scikit-learn mailing list > > > > ????? >> scikit-learn at python.org > > > > ????? >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > ????? > > > > > ????? > _______________________________________________ > > > > ????? > scikit-learn mailing list > > > > ????? > scikit-learn at python.org > > > > ????? > https://mail.python.org/mailman/listinfo/scikit-learn > > > > ???? _______________________________________________ > > > > ???? scikit-learn mailing list > > > > ???? scikit-learn at python.org > > > > ???? https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From t3kcit at gmail.com Thu Jan 10 12:32:17 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 10 Jan 2019 12:32:17 -0500 Subject: [scikit-learn] Next Sprint In-Reply-To: <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org> References: <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com> <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org> <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com> <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org> Message-ID: <4e421d64-2ede-80ba-932a-e366b515133d@gmail.com> On 1/10/19 10:34 AM, Gael Varoquaux wrote: > On Wed, Jan 09, 2019 at 02:09:58PM -0500, Andreas Mueller wrote: >> Ga?l, does the foundation have funds and do you want to use them? >> And/or do you/INRA have funds you want to use? > Neither myself nor Inria has fund to use outside the foundation. The > foundation can commit money if needed. We tend to prefer spending it on > paying senior people to work on the project, as it is the bottleneck (we > are still recruiting, by the way), but such a sprint is important. > > We will also apply for sprint-specific funding sources. If we can > lighten-up your budget, so that you can pay awesome people to work on the > project, it is a good thing. > Ok good to know. And I totally agree about using foundation money to pay senior people. Though discussion time between senior people is also a serious bottleneck imho ;) Any sprint specific funding you're thinking of? Google gave in the past, right? I could cold-email some people (two sigma, bloomberg?) but not sure if that's very promising. From gael.varoquaux at normalesup.org Thu Jan 10 12:36:22 2019 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 10 Jan 2019 18:36:22 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: <4e421d64-2ede-80ba-932a-e366b515133d@gmail.com> References: <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com> <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org> <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com> <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org> <4e421d64-2ede-80ba-932a-e366b515133d@gmail.com> Message-ID: <20190110173622.f54rtctpvftlh2lx@phare.normalesup.org> On Thu, Jan 10, 2019 at 12:32:17PM -0500, Andreas Mueller wrote: > Any sprint specific funding you're thinking of? Google gave in the past, right? I was thinking of PSF. Ga?l From t3kcit at gmail.com Thu Jan 10 12:54:09 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 10 Jan 2019 12:54:09 -0500 Subject: [scikit-learn] Next Sprint In-Reply-To: <20190110173622.f54rtctpvftlh2lx@phare.normalesup.org> References: <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com> <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org> <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com> <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org> <4e421d64-2ede-80ba-932a-e366b515133d@gmail.com> <20190110173622.f54rtctpvftlh2lx@phare.normalesup.org> Message-ID: <0eb889b1-10f2-4649-f035-602c589a8c6c@gmail.com> Do you or anyone in your team has cycles to do that? I certainly don't, but I could try to delegate (to the single person I delegate everything to ;) On 1/10/19 12:36 PM, Gael Varoquaux wrote: > On Thu, Jan 10, 2019 at 12:32:17PM -0500, Andreas Mueller wrote: >> Any sprint specific funding you're thinking of? Google gave in the past, right? > I was thinking of PSF. > > Ga?l > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Thu Jan 10 13:19:05 2019 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 10 Jan 2019 19:19:05 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: <0eb889b1-10f2-4649-f035-602c589a8c6c@gmail.com> References: <32r9p0j48h2ubjbl55ir612a.1545496059736@gmail.com> <20190107213844.mjn3mas743cbrsrs@phare.normalesup.org> <6b1a85d7-029e-6024-d29c-75dbb0828735@gmail.com> <20190110153408.hrxjcuy2zbj3t22o@phare.normalesup.org> <4e421d64-2ede-80ba-932a-e366b515133d@gmail.com> <20190110173622.f54rtctpvftlh2lx@phare.normalesup.org> <0eb889b1-10f2-4649-f035-602c589a8c6c@gmail.com> Message-ID: <20190110181905.6pyuuaj4vl4vdznz@phare.normalesup.org> On Thu, Jan 10, 2019 at 12:54:09PM -0500, Andreas Mueller wrote: > Do you or anyone in your team has cycles to do that? I asked Guillaume Lema?tre to do it. He has started. Ga?l > I certainly don't, but I could try to delegate (to the single person I > delegate everything to ;) > On 1/10/19 12:36 PM, Gael Varoquaux wrote: > > On Thu, Jan 10, 2019 at 12:32:17PM -0500, Andreas Mueller wrote: > > > Any sprint specific funding you're thinking of? Google gave in the past, right? > > I was thinking of PSF. > > Ga?l > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From rohanlekhwani at gmail.com Fri Jan 11 05:32:41 2019 From: rohanlekhwani at gmail.com (Rohan Lekhwani) Date: Fri, 11 Jan 2019 16:02:41 +0530 Subject: [scikit-learn] GSoC 2019 Message-ID: Hello, I'm an undergraduate interested in participating in GSoC 2019. I wished to enquire if scikit-learn would be participating under the umbrella of Python Software Foundation this year as a sub-org.Thanks. Rohan -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Jan 16 05:49:48 2019 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 16 Jan 2019 11:49:48 +0100 Subject: [scikit-learn] Non-core developers at the sprint Message-ID: <20190116104948.d3hytjd3zvvcpuxl@phare.normalesup.org> Dear users and developers, We have a sprint coming up in Paris Feb 25th to March 1st: https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events Looking at the list of people who are coming, I am noticing that we have mostly core developers. While the priority of the sprint is to work on the big picture rather than onboarding, I am worried that there might be some self-selection happening. I am sure that some excellent people, who are contributors yet not core contributors could come. I would like to invite people who already have contributed and want to get more involved in the project to contact us to join the sprint. Specifically, we are willing to fund accommodation and travel for one or two participants. Please send a short message to Guillaume Lema?tre and myself presenting what you have contributed and what you would like to contribute, as well as your funding needs. We will curate this list and core contributors will settle on who we can accommodate. Cheers, Ga?l -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From pahome.chen at mirlab.org Wed Jan 16 23:29:04 2019 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 17 Jan 2019 12:29:04 +0800 Subject: [scikit-learn] Any clustering algo to cluster multiple timing series data? Message-ID: Cluster algo cluster samples by calculating the euclidean distance. I wonder if any clustering algo can cluster the timing series data? EX: Every items has there sold numbers of everyday. Item,Day1,Day2,Day3,Day4,Day5 A,1,5,1,5,1 B,10,50,10,50,10, C,4,70,30,10,50 The difference ratio of A and B are 500%,20%,500%,20%, I want to make A&B be the same cluster, C is another one. If I don't want to calculate the difference ratio of each samples Is there anyway to cluster by the difference ratio of each samples? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From mbrynildsen at grundfos.com Thu Jan 17 02:05:25 2019 From: mbrynildsen at grundfos.com (Mikkel Haggren Brynildsen) Date: Thu, 17 Jan 2019 07:05:25 +0000 Subject: [scikit-learn] Any clustering algo to cluster multiple timing series data? In-Reply-To: References: Message-ID: What about dynamic time warping ? Sendt fra min iPhone > Den 17. jan. 2019 kl. 05.31 skrev lampahome : > > Cluster algo cluster samples by calculating the euclidean distance. > I wonder if any clustering algo can cluster the timing series data? > > EX: > Every items has there sold numbers of everyday. > Item,Day1,Day2,Day3,Day4,Day5 > A,1,5,1,5,1 > B,10,50,10,50,10, > C,4,70,30,10,50 > > The difference ratio of A and B are 500%,20%,500%,20%, > I want to make A&B be the same cluster, C is another one. > > If I don't want to calculate the difference ratio of each samples > > Is there anyway to cluster by the difference ratio of each samples? > > thx > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From pahome.chen at mirlab.org Thu Jan 17 02:45:11 2019 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 17 Jan 2019 15:45:11 +0800 Subject: [scikit-learn] Any clustering algo to cluster multiple timing series data? In-Reply-To: References: Message-ID: Mikkel Haggren Brynildsen ? 2019?1?17? ?? ??3:07??? > What about dynamic time warping ? > I thought DTW is used to different length of two datasets But I only get the same length of two datasets. Maybe it doesn't work? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mbrynildsen at grundfos.com Thu Jan 17 02:58:39 2019 From: mbrynildsen at grundfos.com (Mikkel Haggren Brynildsen) Date: Thu, 17 Jan 2019 07:58:39 +0000 Subject: [scikit-learn] Any clustering algo to cluster multiple timing series data? In-Reply-To: References: Message-ID: You can use it to get a single similarity / closeness number between two timeseries and then feed that into a clustering algorithm. For instance look at https://github.com/markdregan/K-Nearest-Neighbors-with-Dynamic-Time-Warping as a first idea: if you expand the distance function d = lambda x,y: abs(x-y) to a multivariate local distance d2 = lambda a,b: np.sqrt(float((a[0]-b[0])**2 + (a[1]-b[1])**2) (or any other n-dim metric) Then you have an algorithm that could cluster the timeseries. It does also work when the timeseries are of equal length? Best Mikkel Brynildsen From: scikit-learn On Behalf Of lampahome Sent: 17. januar 2019 08:45 To: Scikit-learn mailing list Subject: Re: [scikit-learn] Any clustering algo to cluster multiple timing series data? Mikkel Haggren Brynildsen > ? 2019?1?17? ?? ??3:07??? What about dynamic time warping ? I thought DTW is used to different length of two datasets But I only get the same length of two datasets. Maybe it doesn't work? -------------- next part -------------- An HTML attachment was scrubbed... URL: From alexandre.gramfort at inria.fr Thu Jan 17 03:53:35 2019 From: alexandre.gramfort at inria.fr (Alexandre Gramfort) Date: Thu, 17 Jan 2019 09:53:35 +0100 Subject: [scikit-learn] Any clustering algo to cluster multiple timing series data? In-Reply-To: References: Message-ID: you can have a look at : https://tslearn.readthedocs.io/en/latest/ Alex On Thu, Jan 17, 2019 at 9:01 AM Mikkel Haggren Brynildsen wrote: > > You can use it to get a single similarity / closeness number between two timeseries and then feed that into a clustering algorithm. > > > > For instance look at > > https://github.com/markdregan/K-Nearest-Neighbors-with-Dynamic-Time-Warping > > > > > > as a first idea: > > if you expand the distance function d = lambda x,y: abs(x-y) to a multivariate local distance > > > > d2 = lambda a,b: np.sqrt(float((a[0]-b[0])**2 + (a[1]-b[1])**2) > > (or any other n-dim metric) > > > > Then you have an algorithm that could cluster the timeseries. > > > > It does also work when the timeseries are of equal length? > > > > Best > > Mikkel Brynildsen > > > > > > From: scikit-learn On Behalf Of lampahome > Sent: 17. januar 2019 08:45 > To: Scikit-learn mailing list > Subject: Re: [scikit-learn] Any clustering algo to cluster multiple timing series data? > > > > > > > > Mikkel Haggren Brynildsen ? 2019?1?17? ?? ??3:07??? > > What about dynamic time warping ? > > > > I thought DTW is used to different length of two datasets > > But I only get the same length of two datasets. > > Maybe it doesn't work? > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Fri Jan 18 12:18:52 2019 From: t3kcit at gmail.com (Andreas Mueller) Date: Fri, 18 Jan 2019 12:18:52 -0500 Subject: [scikit-learn] Scipy 2019 Tutorial Message-ID: Hey Folks. The scipy tutorial chairs just pinged me about submitting a tutorial. I'm planning to, and wanted to ask if anyone is interested in co-teaching with me. I might transition from the "scipy tutorial" materials (evolved over maybe 5 years) to my own materials, but not sure yet. Nicolas said he'd potentially be interested but I wanted to ask around who else is coming and might be interested. Cheers, Andy From stefanv at berkeley.edu Fri Jan 18 12:56:09 2019 From: stefanv at berkeley.edu (Stefan van der Walt) Date: Fri, 18 Jan 2019 09:56:09 -0800 Subject: [scikit-learn] ANN: scikit-image 0.14.2 Message-ID: <20190118175609.yiiis7w4v6gjpo3n@carbo> Announcement: scikit-image 0.14.2 ================================= This release handles an incompatibility between scikit-image and NumPy 1.16.0, released on January 13th 2019. It contains the following changes from 0.14.1: API changes ----------- - ``skimage.measure.regionprops`` no longer removes singleton dimensions from label images (#3284). To recover the old behavior, replace ``regionprops(label_image)`` calls with ``regionprops(np.squeeze(label_image))`` Bug fixes --------- - Address deprecation of NumPy ``_validate_lengths`` (backport of #3556) - Correctly handle the maximum number of lines in Hough transforms (backport of #3514) - Correctly implement early stopping criterion for rank kernel noise filter (backport of #3503) - Fix ``skimage.measure.regionprops`` for 1x1 inputs (backport of #3284) Enhancements ------------ - Rewrite of ``local_maxima`` with flood-fill (backport of #3022, #3447) Build Process & Testing ----------------------- - Dedicate a ``--pre`` build in appveyor (backport of #3222) - Avoid Travis-CI failure regarding ``skimage.lookfor`` (backport of #3477) - Stop using the ``pytest.fixtures`` decorator (#3558) - Filter out DeprecationPendingWarning for matrix subclass (#3637) - Fix matplotlib test warnings and circular import (#3632) Contributors & Reviewers ------------------------ - Fran?ois Boulogne - Emmanuelle Gouillart - Lars Gr?ter - Mark Harfouche - Juan Nunez-Iglesias - Egor Panfilov - Stefan van der Walt From hamidizade.s at gmail.com Sun Jan 20 12:01:21 2019 From: hamidizade.s at gmail.com (S Hamidizade) Date: Sun, 20 Jan 2019 20:31:21 +0330 Subject: [scikit-learn] Imblearn: SMOTENC Message-ID: Dear Scikit-learners Hi. I would greatly appreciate if you could let me know how to use SMOTENC. I wrote: num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values) print(len(num_indices1)) print(len(cat_indices1)) pipeline=Pipeline(steps= [ # Categorical features ('feature_processing', FeatureUnion(transformer_list = [ ('categorical', MultiColumn(cat_indices1)), #numeric ('numeric', Pipeline(steps = [ ('select', MultiColumn(num_indices1)), ('scale', StandardScaler()) ])) ])), ('clf', rg) ] ) Therefore, as it is indicated I have 5 categorical features. Really, indices 123 to 160 are related to one categorical feature with 37 possible values which is converted into 37 columns using get_dummies. Sorry, I think SMOTENC should be inserted before the classifier ('clf', reg) but I don't know how to define "categorical_features" in SMOTENC. Besides, could you please let me know where to use imblearn.pipeline? Thanks in advance. Best regards, -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Mon Jan 21 05:54:01 2019 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Mon, 21 Jan 2019 11:54:01 +0100 Subject: [scikit-learn] Imblearn: SMOTENC In-Reply-To: References: Message-ID: SMOTENC will internally one hot encode the features, generate new features, and finally decode. So you need to do something like: from imblearn.pipeline import make_pipeline, Pipeline num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values) print(len(num_indices1)) print(len(cat_indices1)) pipeline=Pipeline(steps= [ # Categorical features ('feature_processing', FeatureUnion(transformer_list = [ ('categorical', MultiColumn(cat_indices1)), #numeric ('numeric', Pipeline(steps = [ ('select', MultiColumn(num_indices1)), ('scale', StandardScaler()) ])) ])), ('clf', rg) ] ) pipeline_with_resampling = make_pipeline(SMOTENC(categorical_features=cat_indices_1), pipeline) On Sun, 20 Jan 2019 at 18:05, S Hamidizade wrote: > Dear Scikit-learners > Hi. > > I would greatly appreciate if you could let me know how to use SMOTENC. I > wrote: > > num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) > cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values) > print(len(num_indices1)) > print(len(cat_indices1)) > > pipeline=Pipeline(steps= [ > # Categorical features > ('feature_processing', FeatureUnion(transformer_list = [ > ('categorical', MultiColumn(cat_indices1)), > > #numeric > ('numeric', Pipeline(steps = [ > ('select', MultiColumn(num_indices1)), > ('scale', StandardScaler()) > ])) > ])), > ('clf', rg) > ] > ) > > Therefore, as it is indicated I have 5 categorical features. Really, > indices 123 to 160 are related to one categorical feature with 37 possible > values which is converted into 37 columns using get_dummies. > Sorry, I think SMOTENC should be inserted before the classifier ('clf', > reg) but I don't know how to define "categorical_features" in SMOTENC. > Besides, could you please let me know where to use imblearn.pipeline? > > Thanks in advance. > Best regards, > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Mon Jan 21 05:56:36 2019 From: pahome.chen at mirlab.org (lampahome) Date: Mon, 21 Jan 2019 18:56:36 +0800 Subject: [scikit-learn] Any clustering algo to cluster multiple timing series data? In-Reply-To: References: Message-ID: How about scaling data first by MinMaxScaler and then cluster? What I thought is scaling can scale then into 0~1 section, and it can ignore the quantity of each data After scaling, it shows the increasing/decreasing ratio between each points. Then cluster then by the eucledian distance should work? -------------- next part -------------- An HTML attachment was scrubbed... URL: From lope at usal.es Tue Jan 22 04:55:44 2019 From: lope at usal.es (=?UTF-8?B?RGFuaWVsIEzDs3Blei1Tw6FuY2hleg==?=) Date: Tue, 22 Jan 2019 10:55:44 +0100 Subject: [scikit-learn] PR #13003: [MRG] Add Tensor Sketch algorithm to Kernel Approximation module Message-ID: Dear all, I recently posted a PR which adds the Tensor Sketch algorithm [1] to the Kernel Approximation module of Scikit-learn. I believe this new feature makes the Kernel Approximation module more complete by providing a data-independent method for polynomial kernel approximation, as the currently included methods either require access to training data (Nystroem) or do not work with polynomial kernels. The implementation has been tested to provide the same results as the original Matlab implementation provided by the author of [1]. I would appreciate any feedback you can provide, Regards, [1] Pham, N., & Pagh, R. (2013, August). Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 239-247). ACM. *Daniel L?pez S?nchez* lope at usal.es / (+34) 687174328 BISITE Research Group (http://bisite.usal.es ) Edificio I+D+i Universidad de Salamanca, C/ Espejo S/N, 37007 Salamanca, Spain -------------- next part -------------- An HTML attachment was scrubbed... URL: From adrin.jalali at gmail.com Tue Jan 22 05:02:06 2019 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 22 Jan 2019 11:02:06 +0100 Subject: [scikit-learn] PR #13003: [MRG] Add Tensor Sketch algorithm to Kernel Approximation module In-Reply-To: References: Message-ID: Hi Daniel, Thanks for the note, but sometimes there can be quite some delay in us reviewing a PR; and the discussion about a PR best should happen on the PR itself. Best, Adrin. On Tue, 22 Jan 2019 at 10:57 Daniel L?pez-S?nchez wrote: > Dear all, > > I recently posted a PR > which adds the > Tensor Sketch algorithm [1] to the Kernel Approximation module of > Scikit-learn. > > I believe this new feature makes the Kernel Approximation module more > complete by providing a data-independent method for polynomial kernel > approximation, as the currently included methods either require access to > training data (Nystroem) or do not work with polynomial kernels. The > implementation has been tested to provide the same results as the original > Matlab implementation provided by the author of [1]. > > I would appreciate any feedback you can provide, > > Regards, > > [1] Pham, N., & Pagh, R. (2013, August). Fast and scalable polynomial > kernels via explicit feature maps. In Proceedings of the 19th ACM SIGKDD > international conference on Knowledge discovery and data mining (pp. > 239-247). ACM. > > *Daniel L?pez S?nchez* > lope at usal.es / (+34) 687174328 <+34%20687%2017%2043%2028> > > BISITE Research Group (http://bisite.usal.es ) > Edificio I+D+i Universidad de Salamanca, C/ Espejo S/N, 37007 > Salamanca, Spain > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Wed Jan 23 05:35:57 2019 From: pahome.chen at mirlab.org (lampahome) Date: Wed, 23 Jan 2019 18:35:57 +0800 Subject: [scikit-learn] Affinity Propagation is the best algo for without choosing the number of cluster? Message-ID: I search for clustering algo to cluster into groups without considering about number of groups. I found AP algo which needn't choose the number of clusters. In my experiments, AP cluster well without choosing any parameters. But I'm not sure any corner case which will caused clustering worse. Does anyone try AP and found some side-effect? or the way to tune parameters? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From ndbecker2 at gmail.com Wed Jan 23 13:26:44 2019 From: ndbecker2 at gmail.com (Neal Becker) Date: Wed, 23 Jan 2019 13:26:44 -0500 Subject: [scikit-learn] affinity propagation not giving desired answer Message-ID: I am not too familiar with affinity propagation, but just trying it out. The problem is to cluster using a distance metric that is euclidean distance but with a limit. When the distance is greater than some threshold than the metric is -Inf. In other words, a point can be accepted into a cluster only if the distance from the point to the cluster center is less than some threshold. It seems my test with affinity propagation will sometimes produce a correct result, but other times the result seems to violate the condition. In the example code, a couple of outlier points seem to be in clusters that are not close at all. I've tried playing with parameters (such as preference) without eliminating the problem. Any suggestions? --------- import numpy as np from sklearn.cluster import AffinityPropagation # from randomgen import RandomGenerator, Xoroshiro128 # rs = RandomGenerator (Xoroshiro128 (0)) from numpy.random import RandomState rs = RandomState(3) pts = rs.uniform (-5, 5, (50,2)) import seaborn as sns import matplotlib.pyplot as plt def distance (ax, ay, bx, by): d = (ax - bx)**2 + (ay - by)**2 if d > 1: return -1e6 else: return -d d = np.empty ((pts.shape[0], pts.shape[0])) for i in range(pts.shape[0]): for j in range(pts.shape[0]): d[i,j] = distance(pts[i,0], pts[i,1], pts[j,0], pts[j,1]) preference = -20 #np.mean (d[d > -1e6]) print ('preference:', preference) clustering = AffinityPropagation(affinity='precomputed', verbose=True, preference=preference) res = clustering.fit(d) c = clustering colors = np.array(sns.color_palette("hls", np.max(c.labels_)+1)) print('n_clusters:', np.max(c.labels_)+1) centers = pts[c.cluster_centers_indices_] plt.scatter (pts[:,0], pts[:,1], c=colors[c.labels_]) plt.scatter (centers[:,0], centers[:,1], marker='X', s=100, c=colors) plt.show() From ndbecker2 at gmail.com Wed Jan 23 15:01:13 2019 From: ndbecker2 at gmail.com (Neal Becker) Date: Wed, 23 Jan 2019 15:01:13 -0500 Subject: [scikit-learn] cluster.affinity_propagation doesn't accept sparse? Message-ID: It appears affinity propagation would appear to accept sparse similarity input: X = check_array(X, accept_sparse='csr') But if I try it, I get: ~/.local/lib/python3.7/site- packages/sklearn/cluster/affinity_propagation_.py in affinity_propagation(S, preference, convergence_iter, max_iter, damping, copy, verbose, return_n_iter) 137 138 # Place preference on the diagonal of S --> 139 S.flat[::(n_samples + 1)] = preference 140 141 A = np.zeros((n_samples, n_samples)) ~/.local/lib/python3.7/site-packages/scipy/sparse/base.py in __getattr__(self, attr) 687 return self.getnnz() 688 else: --> 689 raise AttributeError(attr + " not found") 690 691 def transpose(self, axes=None, copy=False): AttributeError: flat not found From hamidizade.s at gmail.com Thu Jan 24 01:09:55 2019 From: hamidizade.s at gmail.com (S Hamidizade) Date: Thu, 24 Jan 2019 09:39:55 +0330 Subject: [scikit-learn] Imblearn: SMOTENC In-Reply-To: References: Message-ID: Dear Mr. Lemaitre Thanks a lot for sharing your time and knowledge. Unfortunately, it throws the following error: Traceback (most recent call last): 119 File "D:/mifs-master_2/MU/learning-from-imbalanced-classes-master/learning-from-imbalanced-classes-master/continuous/Final Logit/SMOTENC/logit-final - Copy.py", line 419, in 41 pipeline_with_resampling = make_pipeline(SMOTENC(categorical_features=cat_indices1), pipeline) File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 594, in make_pipeline return Pipeline(_name_estimators(steps), memory=memory) File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 119, in __init__ self._validate_steps() File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\sklearn\pipeline.py", line 167, in _validate_steps " '%s' (type %s) doesn't" % (t, type(t))) TypeError: All intermediate steps should be transformers and implement fit and transform. 'SMOTENC(categorical_features=['x95', 'x97', 'x99', 'x100', 'x121_1', 'x121_2', 'x121_3', 'x121_4', 'x121_5', 'x121_6', 'x121_7', 'x121_8', 'x121_9', 'x121_10', 'x121_11', 'x121_12', 'x121_13', 'x121_14', 'x121_15', 'x121_16', 'x121_17', 'x121_18', 'x121_19', 'x121_20', 'x121_21', 'x121_22', 'x121_23', 'x121_24', 'x121_25', 'x121_26', 'x121_27', 'x121_28', 'x121_29', 'x121_30', 'x121_31', 'x121_32', 'x121_33', 'x121_34', 'x121_35', 'x121_36', 'x121_37'], k_neighbors=5, n_jobs=1, random_state=None, sampling_strategy='auto')' (type ) doesn't Thanks in advance. Best regards, On Mon, Jan 21, 2019 at 2:26 PM Guillaume Lema?tre wrote: > SMOTENC will internally one hot encode the features, generate new > features, and finally decode. > So you need to do something like: > > > from imblearn.pipeline import make_pipeline, Pipeline > > num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) > cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values) > print(len(num_indices1)) > print(len(cat_indices1)) > > pipeline=Pipeline(steps= [ > # Categorical features > ('feature_processing', FeatureUnion(transformer_list = [ > ('categorical', MultiColumn(cat_indices1)), > > #numeric > ('numeric', Pipeline(steps = [ > ('select', MultiColumn(num_indices1)), > ('scale', StandardScaler()) > ])) > ])), > ('clf', rg) > ] > ) > > pipeline_with_resampling = make_pipeline(SMOTENC(categorical_features=cat_indices_1), pipeline) > > > > > On Sun, 20 Jan 2019 at 18:05, S Hamidizade wrote: > >> Dear Scikit-learners >> Hi. >> >> I would greatly appreciate if you could let me know how to use >> SMOTENC. I wrote: >> >> num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) >> cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values) >> print(len(num_indices1)) >> print(len(cat_indices1)) >> >> pipeline=Pipeline(steps= [ >> # Categorical features >> ('feature_processing', FeatureUnion(transformer_list = [ >> ('categorical', MultiColumn(cat_indices1)), >> >> #numeric >> ('numeric', Pipeline(steps = [ >> ('select', MultiColumn(num_indices1)), >> ('scale', StandardScaler()) >> ])) >> ])), >> ('clf', rg) >> ] >> ) >> >> Therefore, as it is indicated I have 5 categorical features. Really, >> indices 123 to 160 are related to one categorical feature with 37 possible >> values which is converted into 37 columns using get_dummies. >> Sorry, I think SMOTENC should be inserted before the classifier ('clf', >> reg) but I don't know how to define "categorical_features" in SMOTENC. >> Besides, could you please let me know where to use imblearn.pipeline? >> >> Thanks in advance. >> Best regards, >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Jan 24 02:04:33 2019 From: g.lemaitre58 at gmail.com (=?ISO-8859-1?Q?Guillaume_Lema=EEtre?=) Date: Thu, 24 Jan 2019 08:04:33 +0100 Subject: [scikit-learn] Imblearn: SMOTENC In-Reply-To: Message-ID: <8lp16dn7dcdhmc9ec970igje.1548313473132@gmail.com> As stated in the doc, categorical_features are the indices of the categorical column and not the name of the columns. This is similar to the one hot encoder API. Sent from my phone - sorry to be brief and potential misspell. From pahome.chen at mirlab.org Thu Jan 24 04:13:18 2019 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 24 Jan 2019 17:13:18 +0800 Subject: [scikit-learn] How to determine suitable cluster algo Message-ID: I want to do customized clustering algo for my datasets, that's cuz I don't want to try every algo and its hyperparameters. I though I just define the default range of import hyperparameters ex: number of cluster in K-means. I want to iterate some possible clutering alog like K-means, DBSCAN, AP...etc, and I choose the suitable algo to cluster for me. I'm not sure if that is able to do, but does GridSearchCV work for me? Or any other ways to determine that? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.v.viljamaa at gmail.com Thu Jan 24 04:42:12 2019 From: matti.v.viljamaa at gmail.com (Matti Viljamaa) Date: Thu, 24 Jan 2019 11:42:12 +0200 Subject: [scikit-learn] How to determine suitable cluster algo In-Reply-To: References: Message-ID: <5c498874.1c69fb81.a65c3.68df@mx.google.com> GridSearchCV is meant for tuning hyperparameters of a model over some ranges of configurations and parameter values. Like the documentation explains: https://scikit-learn.org/stable/modules/grid_search.html (and it also has some examples) The (e.g. 10-fold) cross-validation as measure of accuracy (how accurately do different folds attain the value of the statistic) and generalization (that the accuracy remains similar between folds) is at least that what I?m taught at uni. A greater problem is how can one decide, what parameters or e.g. parameter ranges to look for. Since some e.g. float-valued parameters might have some ranges that are ?more often used?, while some others that may not work for most of the time. Additionally e.g. the kernels and stuff have some which may have more general robustness, while some others may become computationally very expensive, when combined with some other parameters (such as that in MLPClassifier some activation functions and hidden_layer_sizes may correlate in increased computation cost, while not necessarily increasing accuracy). The best I?ve figured so far is to: Start with few of the most often used / major parameters and try to get them to produce results that are as accurate as possible with still affordable computation time. Only after that consider adding more params. However, I?ve not found much info regarding how the parameters of different methods are ordered in terms of ?significance?. One could assume that by the preceding ones are more major than the following ones. However, some of the parameters also clearly ?correlate? between each other, so they have cross-effects on accuracy etc. Best is probably just start trying and then perhaps write down, if you notice some general patterns as to what works? There?s also: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html for designing ?pipelines? or sort of ?Design of Experiments? on sklearn algos. Also found this: https://towardsdatascience.com/design-your-engineering-experiment-plan-with-a-simple-python-command-35a6ba52fa35 but have not tried it, nor know if it?s necessary. BR, Matti L?hetetty Windows 10:n S?hk?postista L?hett?j?: lampahome L?hetetty: Thursday, 24 January 2019 11.14 Vastaanottaja: Scikit-learn mailing list Aihe: [scikit-learn] How to determine suitable cluster algo I want to do customized clustering algo for my datasets, that's cuz I don't want to try every algo and its hyperparameters. I though I just define the default range of import hyperparameters ex: number of cluster in K-means. I want to iterate some possible clutering alog like K-means, DBSCAN, AP...etc, and I choose the suitable algo to cluster for me. I'm not sure if that is able to do, but does GridSearchCV work for me? Or any other ways to determine that? thx --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus -------------- next part -------------- An HTML attachment was scrubbed... URL: From hamidizade.s at gmail.com Thu Jan 24 10:17:46 2019 From: hamidizade.s at gmail.com (S Hamidizade) Date: Thu, 24 Jan 2019 18:47:46 +0330 Subject: [scikit-learn] Imblearn: SMOTENC In-Reply-To: References: Message-ID: Thanks. Unfortunately, now the error is: ValueError: Some of the categorical indices are out of range. Indices should be between 0 and 160. Best regards, On Sun, Jan 20, 2019 at 8:31 PM S Hamidizade wrote: > Dear Scikit-learners > Hi. > > I would greatly appreciate if you could let me know how to use SMOTENC. I > wrote: > > num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) > cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values) > print(len(num_indices1)) > print(len(cat_indices1)) > > pipeline=Pipeline(steps= [ > # Categorical features > ('feature_processing', FeatureUnion(transformer_list = [ > ('categorical', MultiColumn(cat_indices1)), > > #numeric > ('numeric', Pipeline(steps = [ > ('select', MultiColumn(num_indices1)), > ('scale', StandardScaler()) > ])) > ])), > ('clf', rg) > ] > ) > > Therefore, as it is indicated I have 5 categorical features. Really, > indices 123 to 160 are related to one categorical feature with 37 possible > values which is converted into 37 columns using get_dummies. > Sorry, I think SMOTENC should be inserted before the classifier ('clf', > reg) but I don't know how to define "categorical_features" in SMOTENC. > Besides, could you please let me know where to use imblearn.pipeline? > > Thanks in advance. > Best regards, > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Thu Jan 24 10:43:04 2019 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Thu, 24 Jan 2019 16:43:04 +0100 Subject: [scikit-learn] Imblearn: SMOTENC In-Reply-To: References: Message-ID: You should open a ticket on imbalanced-learn GitHub issue. This is easier to post a reproducible example and for us to test it. >From the error message, I can understand that you have 161 features and require a feature above the index 160. On Thu, 24 Jan 2019 at 16:19, S Hamidizade wrote: > Thanks. Unfortunately, now the error is: > ValueError: Some of the categorical indices are out of range. Indices > should be between 0 and 160. > Best regards, > > On Sun, Jan 20, 2019 at 8:31 PM S Hamidizade > wrote: > >> Dear Scikit-learners >> Hi. >> >> I would greatly appreciate if you could let me know how to use >> SMOTENC. I wrote: >> >> num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) >> cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values) >> print(len(num_indices1)) >> print(len(cat_indices1)) >> >> pipeline=Pipeline(steps= [ >> # Categorical features >> ('feature_processing', FeatureUnion(transformer_list = [ >> ('categorical', MultiColumn(cat_indices1)), >> >> #numeric >> ('numeric', Pipeline(steps = [ >> ('select', MultiColumn(num_indices1)), >> ('scale', StandardScaler()) >> ])) >> ])), >> ('clf', rg) >> ] >> ) >> >> Therefore, as it is indicated I have 5 categorical features. Really, >> indices 123 to 160 are related to one categorical feature with 37 possible >> values which is converted into 37 columns using get_dummies. >> Sorry, I think SMOTENC should be inserted before the classifier ('clf', >> reg) but I don't know how to define "categorical_features" in SMOTENC. >> Besides, could you please let me know where to use imblearn.pipeline? >> >> Thanks in advance. >> Best regards, >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Thu Jan 24 20:40:41 2019 From: pahome.chen at mirlab.org (lampahome) Date: Fri, 25 Jan 2019 09:40:41 +0800 Subject: [scikit-learn] How to determine suitable cluster algo In-Reply-To: <5c498874.1c69fb81.a65c3.68df@mx.google.com> References: <5c498874.1c69fb81.a65c3.68df@mx.google.com> Message-ID: Maybe the suitable way is try-and-error? What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples That cost too much time for me. Maybe I should define the initial number of cluster based on execution time? Then analyze the next step is increase/decrease the number of cluster? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From matti.v.viljamaa at gmail.com Fri Jan 25 06:43:35 2019 From: matti.v.viljamaa at gmail.com (Matti Viljamaa) Date: Fri, 25 Jan 2019 13:43:35 +0200 Subject: [scikit-learn] How to determine suitable cluster algo In-Reply-To: References: <5c498874.1c69fb81.a65c3.68df@mx.google.com> Message-ID: <5c4af668.1c69fb81.ee649.a884@mx.google.com> For determining what one can afford computaionally, see e.g.: https://stackoverflow.com/questions/22443041/predicting-how-long-an-scikit-learn-classification-will-take-to-run https://www.reddit.com/r/scikit_learn/comments/a746h0/is_there_any_way_to_estimate_how_long_a_given/ L?hetetty Windows 10:n S?hk?postista L?hett?j?: lampahome L?hetetty: Friday, 25 January 2019 3.42 Vastaanottaja: Scikit-learn mailing list Aihe: Re: [scikit-learn] How to determine suitable cluster algo Maybe the suitable way is try-and-error? What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples That cost too much time for me. Maybe I should define the initial number of cluster based on execution time? Then analyze the next step is increase/decrease the number of cluster? thx --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus -------------- next part -------------- An HTML attachment was scrubbed... URL: From liam at chatdesk.com Fri Jan 25 12:26:37 2019 From: liam at chatdesk.com (Liam Geron) Date: Fri, 25 Jan 2019 12:26:37 -0500 Subject: [scikit-learn] Google Cloud ML Error Message-ID: Hi scikit learn contributors, I am currently attempting to transfer our preexisting models into cloud ML for scalability, however I am encountering bugs while running through some tutorial code found here ( https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb ). On both my local machine in a virtual environment and on the cloud shell I'm encountering errors when it comes to version creation and online prediction. For version creation on my local machine and on the cloud shell I'm encountering this error: *"ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Failed to load model: Could not load the model: /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python 3.6.4 (local) and Python 3.5.6 (cloud shell) when running the command: *"gcloud ml-engine versions create $VERSION_NAME \* * --model $MODEL_NAME \* * --config config.yaml"* Any help would be greatly appreciated. Thank you, Liam Geron -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at cgl.ucsf.edu Fri Jan 25 13:24:03 2019 From: ross at cgl.ucsf.edu (Bill Ross) Date: Fri, 25 Jan 2019 10:24:03 -0800 Subject: [scikit-learn] Google Cloud ML Error In-Reply-To: References: Message-ID: <196fae0d-33dd-4f98-4355-7dfaae383971@cgl.ucsf.edu> Dumb generic cross-check from supporting compchem code in the day: What do these give? Might yield a clue, e.g. all model files seeing this got corrupted somehow. $ file */tmp/model/0001/model.joblib* *$ ls -l ***/tmp/model/0001/model.joblib** ** ** On 1/25/19 9:26 AM, Liam Geron wrote: > Hi scikit learn contributors, > > I am currently attempting to transfer our preexisting models into > cloud ML for scalability, however I am encountering bugs while running > through some tutorial code found > here?(https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb). > > On both my local machine in a virtual environment and on the cloud > shell I'm encountering errors when it comes to version creation and > online prediction. For version creation on my local machine and on the > cloud shell I'm encountering this error: *"ERROR: > (gcloud.ml-engine.versions.create) Bad model detected with error:? > "Failed to load model: Could not load the model: > /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python 3.6.4 > (local) and Python 3.5.6 (cloud shell) when running the command: > > *"gcloud ml-engine versions create $VERSION_NAME \* > *? ? --model $MODEL_NAME \* > *? ? --config config.yaml"* > > Any help would be greatly appreciated. > > Thank you, > Liam Geron > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From liam at chatdesk.com Fri Jan 25 13:54:21 2019 From: liam at chatdesk.com (Liam Geron) Date: Fri, 25 Jan 2019 13:54:21 -0500 Subject: [scikit-learn] Google Cloud ML Error In-Reply-To: <196fae0d-33dd-4f98-4355-7dfaae383971@cgl.ucsf.edu> References: <196fae0d-33dd-4f98-4355-7dfaae383971@cgl.ucsf.edu> Message-ID: No such luck, the file doesn't seem to exist. Here's the output on my local:* "ls: /tmp/model/0001/model.joblib: No such file or directory"* and *"/tmp/model/0001/model.joblib: cannot open `/tmp/model/0001/model.joblib' (No such file or directory)"* and on the cloud shell: *"ls: cannot access '/tmp/model/0001/model.joblib': No such file or directory"* and *"/bin/sh: 1: file: not found".* On Fri, Jan 25, 2019 at 1:29 PM Bill Ross wrote: > Dumb generic cross-check from supporting compchem code in the day: What do > these give? Might yield a clue, e.g. all model files seeing this got > corrupted somehow. > > $ file */tmp/model/0001/model.joblib* > > *$ ls -l **/tmp/model/0001/model.joblib* > > > On 1/25/19 9:26 AM, Liam Geron wrote: > > Hi scikit learn contributors, > > I am currently attempting to transfer our preexisting models into cloud ML > for scalability, however I am encountering bugs while running through some > tutorial code found here ( > https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb > ). > > On both my local machine in a virtual environment and on the cloud shell > I'm encountering errors when it comes to version creation and online > prediction. For version creation on my local machine and on the cloud shell > I'm encountering this error: *"ERROR: (gcloud.ml-engine.versions.create) > Bad model detected with error: "Failed to load model: Could not load the > model: /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python > 3.6.4 (local) and Python 3.5.6 (cloud shell) when running the command: > > *"gcloud ml-engine versions create $VERSION_NAME \* > * --model $MODEL_NAME \* > * --config config.yaml"* > > Any help would be greatly appreciated. > > Thank you, > Liam Geron > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at cgl.ucsf.edu Fri Jan 25 14:33:01 2019 From: ross at cgl.ucsf.edu (Bill Ross) Date: Fri, 25 Jan 2019 11:33:01 -0800 Subject: [scikit-learn] Google Cloud ML Error In-Reply-To: References: <196fae0d-33dd-4f98-4355-7dfaae383971@cgl.ucsf.edu> Message-ID: Have you updated the project since this: Since joblib is involved here as well, I'd look at that checkin. Joblib expects there to be a model, maybe it is just configure to look in the wrong place. On 1/25/19 10:54 AM, Liam Geron wrote: > No such luck, the file doesn't seem to exist. Here's the output on my > local:*"ls: /tmp/model/0001/model.joblib: No such file or directory"* > * > * > and *"/tmp/model/0001/model.joblib: cannot open > `/tmp/model/0001/model.joblib' (No such file or directory)"* > * > * > and on the cloud shell: *"ls: cannot access > '/tmp/model/0001/model.joblib': No such file or directory"* > * > * > and *"/bin/sh: 1: file: not found".* > > On Fri, Jan 25, 2019 at 1:29 PM Bill Ross > wrote: > > Dumb generic cross-check from supporting compchem code in the day: > What do these give? Might yield a clue, e.g. all model files > seeing this got corrupted somehow. > > $ file */tmp/model/0001/model.joblib* > > *$ ls -l ***/tmp/model/0001/model.joblib** > > ** > ** > > On 1/25/19 9:26 AM, Liam Geron wrote: >> Hi scikit learn contributors, >> >> I am currently attempting to transfer our preexisting models into >> cloud ML for scalability, however I am encountering bugs while >> running through some tutorial code found >> here?(https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb). >> >> On both my local machine in a virtual environment and on the >> cloud shell I'm encountering errors when it comes to version >> creation and online prediction. For version creation on my local >> machine and on the cloud shell I'm encountering this error: >> *"ERROR: (gcloud.ml-engine.versions.create) Bad model detected >> with error:? "Failed to load model: Could not load the model: >> /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python >> 3.6.4 (local) and Python 3.5.6 (cloud shell) when running the >> command: >> >> *"gcloud ml-engine versions create $VERSION_NAME \* >> *? ? --model $MODEL_NAME \* >> *? ? --config config.yaml"* >> >> Any help would be greatly appreciated. >> >> Thank you, >> Liam Geron >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bjpobekjinilbgej.png Type: image/png Size: 19872 bytes Desc: not available URL: From liam at chatdesk.com Fri Jan 25 15:16:49 2019 From: liam at chatdesk.com (Liam Geron) Date: Fri, 25 Jan 2019 15:16:49 -0500 Subject: [scikit-learn] Google Cloud ML Error In-Reply-To: References: <196fae0d-33dd-4f98-4355-7dfaae383971@cgl.ucsf.edu> Message-ID: As in updated the sklearn module or the joblib module? I'm currently running sklearn on 0.19.1 and joblib on 0.13.1. Do I need to be running them on a specific version? On Fri, Jan 25, 2019 at 2:35 PM Bill Ross wrote: > Have you updated the project since this: > > Since joblib is involved here as well, I'd look at that checkin. Joblib > expects there to be a model, maybe it is just configure to look in the > wrong place. > > > On 1/25/19 10:54 AM, Liam Geron wrote: > > No such luck, the file doesn't seem to exist. Here's the output on my > local:* "ls: /tmp/model/0001/model.joblib: No such file or directory"* > > and *"/tmp/model/0001/model.joblib: cannot open > `/tmp/model/0001/model.joblib' (No such file or directory)"* > > and on the cloud shell: *"ls: cannot access > '/tmp/model/0001/model.joblib': No such file or directory"* > > and *"/bin/sh: 1: file: not found".* > > On Fri, Jan 25, 2019 at 1:29 PM Bill Ross wrote: > >> Dumb generic cross-check from supporting compchem code in the day: What >> do these give? Might yield a clue, e.g. all model files seeing this got >> corrupted somehow. >> >> $ file */tmp/model/0001/model.joblib* >> >> *$ ls -l **/tmp/model/0001/model.joblib* >> >> >> On 1/25/19 9:26 AM, Liam Geron wrote: >> >> Hi scikit learn contributors, >> >> I am currently attempting to transfer our preexisting models into cloud >> ML for scalability, however I am encountering bugs while running through >> some tutorial code found here ( >> https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb >> ). >> >> On both my local machine in a virtual environment and on the cloud shell >> I'm encountering errors when it comes to version creation and online >> prediction. For version creation on my local machine and on the cloud shell >> I'm encountering this error: *"ERROR: (gcloud.ml-engine.versions.create) >> Bad model detected with error: "Failed to load model: Could not load the >> model: /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python >> 3.6.4 (local) and Python 3.5.6 (cloud shell) when running the command: >> >> *"gcloud ml-engine versions create $VERSION_NAME \* >> * --model $MODEL_NAME \* >> * --config config.yaml"* >> >> Any help would be greatly appreciated. >> >> Thank you, >> Liam Geron >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bjpobekjinilbgej.png Type: image/png Size: 19872 bytes Desc: not available URL: From matti.v.viljamaa at gmail.com Fri Jan 25 15:31:20 2019 From: matti.v.viljamaa at gmail.com (Matti Viljamaa) Date: Fri, 25 Jan 2019 22:31:20 +0200 Subject: [scikit-learn] How to determine suitable cluster algo In-Reply-To: <5c4af668.1c69fb81.ee649.a884@mx.google.com> References: <5c498874.1c69fb81.a65c3.68df@mx.google.com> <5c4af668.1c69fb81.ee649.a884@mx.google.com> Message-ID: <5c4b7219.1c69fb81.72c03.c685@mx.google.com> Also, Remember that some algos may exhibit ?sweet spots? w.r.t. computation time and gained accuracy. So you might want to keep measuring ?explained variance?, while you add complexity to your models. And then do plots of model complexity vs explained variance. E.g. in MLPClassifier you?d plot e.g. hidden layers against explained variance to figure out where adding hidden layers starts to exhibit lesser gain in explained variance. L?hetetty Windows 10:n S?hk?postista L?hett?j?: Matti Viljamaa L?hetetty: Friday, 25 January 2019 13.43 Vastaanottaja: Scikit-learn mailing list Aihe: VS: [scikit-learn] How to determine suitable cluster algo For determining what one can afford computaionally, see e.g.: https://stackoverflow.com/questions/22443041/predicting-how-long-an-scikit-learn-classification-will-take-to-run https://www.reddit.com/r/scikit_learn/comments/a746h0/is_there_any_way_to_estimate_how_long_a_given/ L?hetetty Windows 10:n S?hk?postista L?hett?j?: lampahome L?hetetty: Friday, 25 January 2019 3.42 Vastaanottaja: Scikit-learn mailing list Aihe: Re: [scikit-learn] How to determine suitable cluster algo Maybe the suitable way is try-and-error? What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples That cost too much time for me. Maybe I should define the initial number of cluster based on execution time? Then analyze the next step is increase/decrease the number of cluster? thx Virus-free. www.avast.com --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus -------------- next part -------------- An HTML attachment was scrubbed... URL: From ross at cgl.ucsf.edu Fri Jan 25 18:05:57 2019 From: ross at cgl.ucsf.edu (Bill Ross) Date: Fri, 25 Jan 2019 15:05:57 -0800 Subject: [scikit-learn] Google Cloud ML Error Message-ID: <8g2jw1kfh6uo8fntxyywcyn3.1548457557070@email.android.com> I'm a kibitzer who never ran it myself, just a compulsive debugger looking at a basic possibility. Bill
-------- Original message --------
From: Liam Geron
Date:01/25/2019 12:16 PM (GMT-08:00)
To: Scikit-learn mailing list
Subject: Re: [scikit-learn] Google Cloud ML Error
As in updated the sklearn module or the joblib module? I'm currently running sklearn on 0.19.1 and joblib on 0.13.1. Do I need to be running them on a specific version? On Fri, Jan 25, 2019 at 2:35 PM Bill Ross wrote: Have you updated the project since this: Since joblib is involved here as well, I'd look at that checkin. Joblib expects there to be a model, maybe it is just configure to look in the wrong place. On 1/25/19 10:54 AM, Liam Geron wrote: No such luck, the file doesn't seem to exist. Here's the output on my local: "ls: /tmp/model/0001/model.joblib: No such file or directory" and "/tmp/model/0001/model.joblib: cannot open `/tmp/model/0001/model.joblib' (No such file or directory)" and on the cloud shell: "ls: cannot access '/tmp/model/0001/model.joblib': No such file or directory" and "/bin/sh: 1: file: not found". On Fri, Jan 25, 2019 at 1:29 PM Bill Ross wrote: Dumb generic cross-check from supporting compchem code in the day: What do these give? Might yield a clue, e.g. all model files seeing this got corrupted somehow. $ file /tmp/model/0001/model.joblib $ ls -l /tmp/model/0001/model.joblib On 1/25/19 9:26 AM, Liam Geron wrote: Hi scikit learn contributors, I am currently attempting to transfer our preexisting models into cloud ML for scalability, however I am encountering bugs while running through some tutorial code found here (https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb). On both my local machine in a virtual environment and on the cloud shell I'm encountering errors when it comes to version creation and online prediction. For version creation on my local machine and on the cloud shell I'm encountering this error: "ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Failed to load model: Could not load the model: /tmp/model/0001/model.joblib. 32. (Error code: 0)"" with Python 3.6.4 (local) and Python 3.5.6 (cloud shell) when running the command: "gcloud ml-engine versions create $VERSION_NAME \ --model $MODEL_NAME \ --config config.yaml" Any help would be greatly appreciated. Thank you, Liam Geron _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: bjpobekjinilbgej.png Type: image/png Size: 19872 bytes Desc: not available URL: From avigross at verizon.net Fri Jan 25 21:34:09 2019 From: avigross at verizon.net (Avi Gross) Date: Fri, 25 Jan 2019 21:34:09 -0500 Subject: [scikit-learn] How to determine suitable cluster algo In-Reply-To: <5c4b7219.1c69fb81.72c03.c685@mx.google.com> References: <5c498874.1c69fb81.a65c3.68df@mx.google.com> <5c4af668.1c69fb81.ee649.a884@mx.google.com> <5c4b7219.1c69fb81.72c03.c685@mx.google.com> Message-ID: <005701d4b51f$9e270d00$da752700$@verizon.net> My comments are at the end as some people do not like top posts. From: scikit-learn On Behalf Of Matti Viljamaa Sent: Friday, January 25, 2019 3:31 PM To: Scikit-learn mailing list Subject: Re: [scikit-learn] How to determine suitable cluster algo Also, Remember that some algos may exhibit ?sweet spots? w.r.t. computation time and gained accuracy. So you might want to keep measuring ?explained variance?, while you add complexity to your models. And then do plots of model complexity vs explained variance. E.g. in MLPClassifier you?d plot e.g. hidden layers against explained variance to figure out where adding hidden layers starts to exhibit lesser gain in explained variance. L?hetetty Windows 10:n S?hk?posti sta L?hett?j?: Matti Viljamaa L?hetetty: Friday, 25 January 2019 13.43 Vastaanottaja: Scikit-learn mailing list Aihe: VS: [scikit-learn] How to determine suitable cluster algo For determining what one can afford computaionally, see e.g.: https://stackoverflow.com/questions/22443041/predicting-how-long-an-scikit-learn-classification-will-take-to-run https://www.reddit.com/r/scikit_learn/comments/a746h0/is_there_any_way_to_estimate_how_long_a_given/ L?hetetty Windows 10:n S?hk?posti sta L?hett?j?: lampahome L?hetetty: Friday, 25 January 2019 3.42 Vastaanottaja: Scikit-learn mailing list Aihe: Re: [scikit-learn] How to determine suitable cluster algo Maybe the suitable way is try-and-error? What I'm interesting is that my datasets is very huge and I can't try number of cluster from 1 to N if I have N samples That cost too much time for me. Maybe I should define the initial number of cluster based on execution time? Then analyze the next step is increase/decrease the number of cluster? thx __COMMENT__ This is a question, not a suggestion. The poster suggested they have such a large amount of data that looking for larger numbers of clusters to find a ?sweet? spot may take too much time. Is there any value in taking a much smaller random sample of data that remains big enough and trying that on a reasonable range of clusters? The results would not be definitive but might supply a clue as to what range to try again with the full data. As I see mentioned, the run time may not be going up if the data is constant and the number of clusters varies. I am not sure what clustering algorithms you want to use but for something like K-means with reasonable data, generally the number of clusters that show meaningful results are usually much smaller than the number of items in the data. The algorithms often terminate when successive runs show little change. This may likely be a tunable parameter. So if you ask it to make N+1 clusters it may even terminate sooner than for N if it is because that number of clusters more closely resembles the variation in the data. And, again, if you are using a K-means variant, it may be better to use some human intervention to see if a particular level of clustering fits some model you can make that explains what each cluster has in common. If you overfit, the number of clusters can effectively be the number of unique items in your data and probably has no meaningful purpose. Again, just a question. There are algorithms out there that deal better with large data than others. Avi -------------- next part -------------- An HTML attachment was scrubbed... URL: From hamidizade.s at gmail.com Sat Jan 26 12:24:02 2019 From: hamidizade.s at gmail.com (S Hamidizade) Date: Sat, 26 Jan 2019 20:54:02 +0330 Subject: [scikit-learn] Imblearn: SMOTENC In-Reply-To: References: Message-ID: Thanks. The code is provided here: https://github.com/scikit-learn-contrib/imbalanced-learn/issues/537 Best regards, On Thu, Jan 24, 2019 at 7:15 PM Guillaume Lema?tre wrote: > You should open a ticket on imbalanced-learn GitHub issue. This is easier > to post a reproducible example and for us to test it. > From the error message, I can understand that you have 161 features and > require a feature above the index 160. > > > > On Thu, 24 Jan 2019 at 16:19, S Hamidizade wrote: > >> Thanks. Unfortunately, now the error is: >> ValueError: Some of the categorical indices are out of range. Indices >> should be between 0 and 160. >> Best regards, >> >> On Sun, Jan 20, 2019 at 8:31 PM S Hamidizade >> wrote: >> >>> Dear Scikit-learners >>> Hi. >>> >>> I would greatly appreciate if you could let me know how to use >>> SMOTENC. I wrote: >>> >>> num_indices1 = list(X.iloc[:,np.r_[0:94,95,97,100:123]].columns.values) >>> cat_indices1 = list(X.iloc[:,np.r_[94,96,98,99,123:160]].columns.values) >>> print(len(num_indices1)) >>> print(len(cat_indices1)) >>> >>> pipeline=Pipeline(steps= [ >>> # Categorical features >>> ('feature_processing', FeatureUnion(transformer_list = [ >>> ('categorical', MultiColumn(cat_indices1)), >>> >>> #numeric >>> ('numeric', Pipeline(steps = [ >>> ('select', MultiColumn(num_indices1)), >>> ('scale', StandardScaler()) >>> ])) >>> ])), >>> ('clf', rg) >>> ] >>> ) >>> >>> Therefore, as it is indicated I have 5 categorical features. Really, >>> indices 123 to 160 are related to one categorical feature with 37 possible >>> values which is converted into 37 columns using get_dummies. >>> Sorry, I think SMOTENC should be inserted before the classifier ('clf', >>> reg) but I don't know how to define "categorical_features" in SMOTENC. >>> Besides, could you please let me know where to use imblearn.pipeline? >>> >>> Thanks in advance. >>> Best regards, >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From suryodaybasak at gmail.com Sun Jan 27 01:25:18 2019 From: suryodaybasak at gmail.com (Suryoday Basak) Date: Sun, 27 Jan 2019 00:25:18 -0600 Subject: [scikit-learn] Regarding GSOC and open source contributions Message-ID: Dear Team, Could you let me know if scikit-learn might be a GSOC organization this year? I have a few proposal ideas in mind and have been working to implement certain methods over the existing project, and was wondering if I could talk to someone about how to go about things. Thank you. Regards, *Suryoday Basak* *Graduate Student*, Department of Computer Science and Engineering, *The University of Texas at Arlington* Website: *suryodaybasak.info * Follow me on Medium: *https://medium.com/@suryodaybasak *Astroinformatics Research Group: *http://astrirg.org * * * -------------- next part -------------- An HTML attachment was scrubbed... URL: From liam at chatdesk.com Mon Jan 28 10:28:40 2019 From: liam at chatdesk.com (Liam Geron) Date: Mon, 28 Jan 2019 10:28:40 -0500 Subject: [scikit-learn] Google Cloud ML Engine Error with Sklearn Message-ID: Hi scikit learn contributors, I am currently attempting to transfer our preexisting models into cloud ML for scalability, however I am encountering bugs while running through some tutorial code found here ( https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/sklearn/notebooks/Online%20Prediction%20with%20scikit-learn.ipynb ). On both my local machine in a virtual environment and on the cloud shell I'm encountering errors when it comes to version creation and online prediction. For version creation on my local machine and on the cloud shell I'm encountering this error: *"ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Failed to load model: Could not load the model: /tmp/model/0001/model.joblib. 32. (Error code: 0)""* with Python 3.6.4 (local) and Python 3.5.6 (cloud shell) when running the command: *"gcloud ml-engine versions create $VERSION_NAME \* * --model $MODEL_NAME \* * --config config.yaml"* This is running with joblib version "0.13.1" and sklearn version "0.19.1". Any help would be greatly appreciated. Thank you, Liam Geron -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Tue Jan 29 05:35:50 2019 From: pahome.chen at mirlab.org (lampahome) Date: Tue, 29 Jan 2019 18:35:50 +0800 Subject: [scikit-learn] Is there rule to determine X and y when train regression? Message-ID: I found many example to predict stock, house prices, taxi fare...etc. The field of y almost like below: y : the price of the day And X maybe the day, param which can affect price...etc. Now I want to predict sales of multiple items of multiple stores. Is suitable to let decrease/increase ratio of sales be y? The reason I'm interesting is I don't know how to explain to other people that price as y is normal. So other people may have a question that can we let y be increase/decrease ratio? thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From mohit.srivastava at med.unideb.hu Tue Jan 29 08:10:30 2019 From: mohit.srivastava at med.unideb.hu (Mohit Srivastava) Date: Tue, 29 Jan 2019 14:10:30 +0100 (CET) Subject: [scikit-learn] sklearn.cluster.OPTICS Message-ID: <1288774727.8422712.1548767430428.JavaMail.zimbra@zimbra.unideb.hu> Dear all, I want to use your clustering algorithm "sklearn.cluster.OPTICS". But it is not working and found that it's not available at the moment( found on the internet). Could you please help me with the issue? When would it be possible to use it? Please reply as soon as possible. thanks regards Mohit Srivastava From adrin.jalali at gmail.com Tue Jan 29 08:39:52 2019 From: adrin.jalali at gmail.com (Adrin) Date: Tue, 29 Jan 2019 14:39:52 +0100 Subject: [scikit-learn] sklearn.cluster.OPTICS In-Reply-To: <1288774727.8422712.1548767430428.JavaMail.zimbra@zimbra.unideb.hu> References: <1288774727.8422712.1548767430428.JavaMail.zimbra@zimbra.unideb.hu> Message-ID: Hi, OPTICS is still under development and there are quite a few open issues and PRs regarding the method. It's available on master, but not on any of the releases yet. We will hopefully have it out for the next release. Best, Adrin. On Tue, 29 Jan 2019 at 14:31 Mohit Srivastava < mohit.srivastava at med.unideb.hu> wrote: > Dear all, > > I want to use your clustering algorithm "sklearn.cluster.OPTICS". > But it is not working and found that it's not available at the moment( > found on the internet). > Could you please help me with the issue? > When would it be possible to use it? > Please reply as soon as possible. > thanks > regards > Mohit Srivastava > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Wed Jan 30 05:42:41 2019 From: pahome.chen at mirlab.org (lampahome) Date: Wed, 30 Jan 2019 18:42:41 +0800 Subject: [scikit-learn] Can y of datasets be increasing/decreasing ratio when train regression model? Message-ID: I found many cases in kaggle to predict the quantity or trends. They all set the real quantity as y. But I have question is that does anyone set the changing ratio as y? Like: X y Day1 0.2 Day2 0.1 Day3 0.15 Day4 -0.1 y is the changing ratio compared with previous day. Why anybody set the real quantity(ex: sales, car numbers...etc) as y rather than changing ratio as y? I want to know it's based on experience or other reasons thx -------------- next part -------------- An HTML attachment was scrubbed... URL: From charles.y.zheng at gmail.com Wed Jan 30 12:09:45 2019 From: charles.y.zheng at gmail.com (Charles Zheng) Date: Wed, 30 Jan 2019 12:09:45 -0500 Subject: [scikit-learn] Can y of datasets be increasing/decreasing ratio when train regression model? In-Reply-To: References: Message-ID: Hi lampahome, It is a common practice in financial modeling ( https://en.wikipedia.org/wiki/Capital_asset_pricing_model). [image: image.png] P_t is price at time t, R_t is "return", which is the variable they are trying to predict. Best, Charles On Wed, Jan 30, 2019 at 5:43 AM lampahome wrote: > I found many cases in kaggle to predict the quantity or trends. They all > set the real quantity as y. > > But I have question is that does anyone set the changing ratio as y? > > Like: > > X y > Day1 0.2 > Day2 0.1 > Day3 0.15 > Day4 -0.1 > > y is the changing ratio compared with previous day. > > Why anybody set the real quantity(ex: sales, car numbers...etc) as y > rather than changing ratio as y? > > I want to know it's based on experience or other reasons > > thx > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 2928 bytes Desc: not available URL: From joel.nothman at gmail.com Wed Jan 30 19:46:42 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 31 Jan 2019 11:46:42 +1100 Subject: [scikit-learn] Can y of datasets be increasing/decreasing ratio when train regression model? In-Reply-To: References: Message-ID: Particular regressors may make assumptions about the distribution of y, or its relationship with the features X. You should be aware of those assumptions and reason about whether they are held well enough. A TransformedTargetRegressor may be used to make your target better match those assumptions, e.g. by trying to predict the logarithm or power transform of the original targets, but again you need to look at the distribution of y and the assumptions of the regressor. On Wed, 30 Jan 2019 at 21:44, lampahome wrote: > I found many cases in kaggle to predict the quantity or trends. They all > set the real quantity as y. > > But I have question is that does anyone set the changing ratio as y? > > Like: > > X y > Day1 0.2 > Day2 0.1 > Day3 0.15 > Day4 -0.1 > > y is the changing ratio compared with previous day. > > Why anybody set the real quantity(ex: sales, car numbers...etc) as y > rather than changing ratio as y? > > I want to know it's based on experience or other reasons > > thx > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pahome.chen at mirlab.org Wed Jan 30 20:45:52 2019 From: pahome.chen at mirlab.org (lampahome) Date: Thu, 31 Jan 2019 09:45:52 +0800 Subject: [scikit-learn] Can y of datasets be increasing/decreasing ratio when train regression model? In-Reply-To: References: Message-ID: > but again you need to look at the distribution of y and the assumptions of > the regressor. > > So in the first, Should I plot graph to check y is distribution when X changes? I'm just thinking about how to know if it's distribution. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jaapvankampen at gmail.com Thu Jan 31 04:51:36 2019 From: jaapvankampen at gmail.com (Jaap van Kampen) Date: Thu, 31 Jan 2019 10:51:36 +0100 Subject: [scikit-learn] Bounded logistical regression in Python Message-ID: Hi there! The standard logistical regression solver in scikit-learn assumes the regression equation:

P(X) = 1/ (1 + exp(b0 + b1*X1 + ... + bn*Xn))

.. and solves for the b's using various solver routines. For a specific project, I'd like to bound the regression equation between 0-a (instead of 0-1) and add a variable c to center an independent variable Xk, e.g.

P(X) = a / (1 + exp(b0 + b1*X1 + .. + bn*Xn) * (Xk - c)))

... and solve for a, b's and c. Any thoughts/ideas on how to modify logistic.py to achieve this? I thought of modifying the expit function to reflect the changed equation. But how do a let the solvers know to also include the new variables a and c? Any scripts available that are able to handle my modified logistic regression equation? -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Thu Jan 31 05:48:43 2019 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 31 Jan 2019 21:48:43 +1100 Subject: [scikit-learn] Bounded logistical regression in Python In-Reply-To: References: Message-ID: I don't quite get your terminology, to "add a variable c to center an independent variable Xk", and you've got an extra ) in your equation, so I'm not sure exactly where you want it... If you mean P(X) = a / (1 + exp(b0 + b1*X1 + .. + bn*Xn) * (Xk - c)) then that's the same as P(X) = a / (1 + exp(b0 + b1*X1 + .. + bn*Xn + log(Xk)/log(c)) replace c by exp(1/bk) and you've got the same old logistic regression, haven't you? On Thu, 31 Jan 2019 at 20:53, Jaap van Kampen wrote: > Hi there! > The standard logistical regression solver > > in scikit-learn assumes the regression equation:

> P(X) = 1/ (1 + exp(b0 + b1*X1 + ... + bn*Xn))

> .. and solves for the b's using various solver routines. > > For a specific project, I'd like to bound the regression equation between > 0-a (instead of 0-1) and add a variable c to center an independent variable > Xk, e.g.

> P(X) = a / (1 + exp(b0 + b1*X1 + .. + bn*Xn) * (Xk - c)))

> ... and solve for a, b's and c. > > Any thoughts/ideas on how to modify logistic.py to achieve this? I thought > of modifying the expit > function > to reflect the changed equation. But how do a let the solvers know to also > include the new variables a and c? Any scripts available that are able to > handle my modified logistic regression equation? > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: