From jbbrown at kuhp.kyoto-u.ac.jp Thu Nov 1 03:22:59 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Thu, 1 Nov 2018 16:22:59 +0900 Subject: [scikit-learn] Can I use Sklearn Porter to Generate C++ version of Random Forest Predict function In-Reply-To: References: Message-ID: I, too, would be curious to know if anyone has any experience in doing this. J.B. 2018?11?1?(?) 2:07 Chidhambaranathan R : > Hi, > > I'd like to know if I can use sklearn_porter to generate the C++ version > of Random Forest Regression Predict function. If sklearn_porter doesn't > work, is there any possible alternatives to generate c++ implementation of > RF Regressor Predict function? > > Thanks. > > -- > Regards, > Chidhambaranathan R, > PhD Student, > Electrical and Computer Engineering, > Utah State University > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joshua_feldman at g.harvard.edu Fri Nov 2 00:19:07 2018 From: joshua_feldman at g.harvard.edu (Feldman, Joshua) Date: Fri, 2 Nov 2018 00:19:07 -0400 Subject: [scikit-learn] Fairness Metrics Message-ID: Thanks Andy, I'll look into starting a scikit-learn-contrib project! Best, Josh -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Mon Nov 5 20:54:54 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 5 Nov 2018 20:54:54 -0500 Subject: [scikit-learn] Elbow method function for K-means procedure In-Reply-To: References: Message-ID: <6b3de95c-8676-9272-651a-1b8853d40d14@gmail.com> Hey. This is a method that's relatively easy to implement if someone wants to use it (as witnessed by the shortness of your notebook). It's hard to make it work robustly, though, and we have several cluster evaluation measures in sklearn already. I would be more interested in see some stability based evaluation. You might also be interested in this: https://github.com/scikit-learn/scikit-learn/pull/6948 Cheers, Andy On 10/30/18 6:17 PM, Maiia Bakhova wrote: > Hello everybody! > I would like to offer a new feature for consideration. > Here is my presentation: > https://github.com/Mathemilda/ElbowMethodForK-means/blob/master/Elbow_Method_for_K-Means_Clustering.ipynb > Thanks for your time! If the feature is to be accepted, can you please > tell me what are conventions if any for such function and if there is > a template or other helpful material. > Best regards, > Maiia Bakhova > --------------------- > Mathematician in Data Science > https://www.linkedin.com/in/myabakhova > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From mail at jakob-zeitler.de Tue Nov 6 12:07:26 2018 From: mail at jakob-zeitler.de (Jakob Zeitler) Date: Tue, 6 Nov 2018 12:07:26 -0500 Subject: [scikit-learn] Outlier Detection: Contributing a new Estimator: Rank-Based Outlier Detection Message-ID: <87879542-2E72-44A0-9425-F5ABB1F46330@jakob-zeitler.de> Dear sklearners, I have been working on a rank-based outlier detection algorithm (RBDA) developed here at Syracuse, of which the code I would like to contribute to sklearn as it gives a viable alternative to established algorithms such as LOF (https://www.tandfonline.com/doi/abs/10.1080/00949655.2011.621124 ) Should I be fine if I keep to the general contribution rules regarding estimators? (http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator ) Are they up to date? Because RBDA is <200 citations, I assume it will not pass the inclusion criteria (http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms ) therefore I assume I am dealing with a case of ?scikit-learn-contrib? as discussed here (https://github.com/scikit-learn-contrib/scikit-learn-contrib/blob/master/workflow.md ) If anyone can share common pitfalls of that process, that would be great! Thanks a lot, Jakob -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Nov 6 12:38:24 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 6 Nov 2018 18:38:24 +0100 Subject: [scikit-learn] Outlier Detection: Contributing a new Estimator: Rank-Based Outlier Detection In-Reply-To: <87879542-2E72-44A0-9425-F5ABB1F46330@jakob-zeitler.de> References: <87879542-2E72-44A0-9425-F5ABB1F46330@jakob-zeitler.de> Message-ID: If you are going to make a scikit-learn-contrib project, we recently updated and simplified the project template: On Tue, 6 Nov 2018 at 18:26, Jakob Zeitler wrote: > Dear sklearners, > > I have been working on a rank-based outlier detection algorithm (RBDA) > developed here at Syracuse, of which the code I would like to contribute to > sklearn as it gives a viable alternative to established algorithms such as > LOF (https://www.tandfonline.com/doi/abs/10.1080/00949655.2011.621124) > > Should I be fine if I keep to the general contribution rules regarding > estimators? ( > http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator) > Are they up to date? > > Because RBDA is <200 citations, I assume it will not pass the inclusion > criteria ( > http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms) > therefore I assume I am dealing with a case of ?scikit-learn-contrib? as > discussed here ( > https://github.com/scikit-learn-contrib/scikit-learn-contrib/blob/master/workflow.md > ) > > If anyone can share common pitfalls of that process, that would be great! > > Thanks a lot, > Jakob > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Nov 6 12:39:44 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 6 Nov 2018 18:39:44 +0100 Subject: [scikit-learn] Outlier Detection: Contributing a new Estimator: Rank-Based Outlier Detection In-Reply-To: References: <87879542-2E72-44A0-9425-F5ABB1F46330@jakob-zeitler.de> Message-ID: Ups the remaining of the message: https://github.com/scikit-learn-contrib/project-template You can refer to: https://sklearn-template.readthedocs.io/en/latest/ and the user guide which is really similar to the documentation that you mentioned. On Tue, 6 Nov 2018 at 18:38, Guillaume Lema?tre wrote: > If you are going to make a scikit-learn-contrib project, we recently > updated and simplified the project template: > > On Tue, 6 Nov 2018 at 18:26, Jakob Zeitler wrote: > >> Dear sklearners, >> >> I have been working on a rank-based outlier detection algorithm (RBDA) >> developed here at Syracuse, of which the code I would like to contribute to >> sklearn as it gives a viable alternative to established algorithms such as >> LOF (https://www.tandfonline.com/doi/abs/10.1080/00949655.2011.621124) >> >> Should I be fine if I keep to the general contribution rules regarding >> estimators? ( >> http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator) >> Are they up to date? >> >> Because RBDA is <200 citations, I assume it will not pass the inclusion >> criteria ( >> http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms) >> therefore I assume I am dealing with a case of ?scikit-learn-contrib? as >> discussed here ( >> https://github.com/scikit-learn-contrib/scikit-learn-contrib/blob/master/workflow.md >> ) >> >> If anyone can share common pitfalls of that process, that would be great! >> >> Thanks a lot, >> Jakob >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > > -- > Guillaume Lemaitre > INRIA Saclay - Parietal team > Center for Data Science Paris-Saclay > https://glemaitre.github.io/ > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From cnaathan at gmail.com Tue Nov 6 13:05:58 2018 From: cnaathan at gmail.com (Chidhambaranathan R) Date: Tue, 6 Nov 2018 11:05:58 -0700 Subject: [scikit-learn] Random Forest Regressor -- Implementation in C++ Message-ID: Hi all, I'm using sklearn Randon Forest Regressor. I have trained the model in Python and dumped the model using Pickle. The model is used inside C++ application for prediction. Currently I have implemented predict function using "Embedding Python in C++" concept. However, the problem is, it is causing huge runtime overhead and I can't afford that much runtime. Is there a way, can I extract the decision tree information and RF parameters from RF trained model? I can implemented the decision trees with RF parameters in C++. If you aware of alternative directions, please let me know. I would appreciate it. Thanks. -- Regards, Chidham -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Nov 6 13:12:33 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 6 Nov 2018 13:12:33 -0500 Subject: [scikit-learn] Outlier Detection: Contributing a new Estimator: Rank-Based Outlier Detection In-Reply-To: <87879542-2E72-44A0-9425-F5ABB1F46330@jakob-zeitler.de> References: <87879542-2E72-44A0-9425-F5ABB1F46330@jakob-zeitler.de> Message-ID: <6c66d771-915f-86c6-dff0-8a61db0574b6@gmail.com> Hi Jakob. Sounds like you read up on all the right things. Indeed sounds like a case for scikit-learn-contrib. I think the most common pitfall is that it might take some time for someone to review the project to get merged into scikit-learn-contrib. I'm not sure if there's a backlog right now. Though Alex Gramfort might be interested in this, which might speed up the process ;) Cheers, Andy On 11/6/18 9:07 AM, Jakob Zeitler wrote: > Dear sklearners, > > I have been working on a rank-based outlier detection algorithm (RBDA) > developed here at Syracuse, of which the code I would like to > contribute to sklearn as it gives a viable alternative to established > algorithms such as LOF > (https://www.tandfonline.com/doi/abs/10.1080/00949655.2011.621124) > > Should I be fine if I keep to the general contribution rules regarding > estimators? > (http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator) > Are they up to date? > > Because RBDA is <200 citations, I assume it will not pass the > inclusion criteria > (http://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms) > ?therefore I assume I am dealing with a case of ?scikit-learn-contrib? > as discussed here > (https://github.com/scikit-learn-contrib/scikit-learn-contrib/blob/master/workflow.md) > > If anyone can share common pitfalls of that process, that would be great! > > Thanks a lot, > Jakob > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Nov 6 13:20:09 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 6 Nov 2018 13:20:09 -0500 Subject: [scikit-learn] Pipegraph example: KMeans + LDA In-Reply-To: References: <681604c2-9f15-6692-682c-728f81e1d2ef@gmail.com> Message-ID: On 10/29/18 8:08 AM, Manuel Castej?n Limas wrote: > The long story short: Thank you for your time & sorry for > inaccuracies; a few words selling a modular approach to your > developments; and a request on your opinion on parallelizing > Pipegraph?using dask. I'm not very experienced with dask, so I'm probably not the right person to help you. And I totally get that pipegraph is more flexible than whatever hack I came up with :) In the mean-time microsoft launched nimbusml: https://docs.microsoft.com/en-us/nimbusml/overview It actually implements something very similar to pipegraph on top of ML.net FYI And I also gave the MS people a hard time when discussing their pipeline object ;) I'm still not entirely convinced this is necessary, but for NimbusML, the underlying library is built with the DAG in mind. So different algorithms have different output slots that you can tab into, while sklearn basically "only" has transform and predict (and predict proba). From t3kcit at gmail.com Tue Nov 6 13:20:13 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 6 Nov 2018 13:20:13 -0500 Subject: [scikit-learn] Pipegraph example: KMeans + LDA In-Reply-To: References: <681604c2-9f15-6692-682c-728f81e1d2ef@gmail.com> Message-ID: On 10/29/18 8:08 AM, Manuel Castej?n Limas wrote: > The long story short: Thank you for your time & sorry for > inaccuracies; a few words selling a modular approach to your > developments; and a request on your opinion on parallelizing > Pipegraph?using dask. I'm not very experienced with dask, so I'm probably not the right person to help you. And I totally get that pipegraph is more flexible than whatever hack I came up with :) In the mean-time microsoft launched nimbusml: https://docs.microsoft.com/en-us/nimbusml/overview It actually implements something very similar to pipegraph on top of ML.net FYI And I also gave the MS people a hard time when discussing their pipeline object ;) I'm still not entirely convinced this is necessary, but for NimbusML, the underlying library is built with the DAG in mind. So different algorithms have different output slots that you can tab into, while sklearn basically "only" has transform and predict (and predict proba). From joel.nothman at gmail.com Tue Nov 6 15:10:56 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Wed, 7 Nov 2018 07:10:56 +1100 Subject: [scikit-learn] Random Forest Regressor -- Implementation in C++ In-Reply-To: References: Message-ID: See https://github.com/ajtulloch/sklearn-compiledtrees/ and https://github.com/nok/sklearn-porter -------------- next part -------------- An HTML attachment was scrubbed... URL: From fabian.sippl at gmx.net Tue Nov 6 17:37:07 2018 From: fabian.sippl at gmx.net (Fabian Sippl) Date: Tue, 6 Nov 2018 23:37:07 +0100 Subject: [scikit-learn] Mailing List Message-ID: An HTML attachment was scrubbed... URL: From karen.cj at gmail.com Tue Nov 6 17:39:20 2018 From: karen.cj at gmail.com (Chen Jin) Date: Tue, 6 Nov 2018 17:39:20 -0500 Subject: [scikit-learn] Mailing List In-Reply-To: References: Message-ID: Could you please remove me from your mailing list? On Tue, Nov 6, 2018 at 5:38 PM Fabian Sippl wrote: > Hi Scikit-Team, > > Could you please remove me from the mailing list ? > > Thank you! > > Kind regards, > Fabian Sippl > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Tue Nov 6 17:47:47 2018 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Tue, 6 Nov 2018 23:47:47 +0100 Subject: [scikit-learn] Mailing List In-Reply-To: References: Message-ID: Unsubscribe yourself at https://mail.python.org/mailman/listinfo/scikit-learn On Tue, 6 Nov 2018 at 23:42, Chen Jin wrote: > Could you please remove me from your mailing list? > > On Tue, Nov 6, 2018 at 5:38 PM Fabian Sippl wrote: > >> Hi Scikit-Team, >> >> Could you please remove me from the mailing list ? >> >> Thank you! >> >> Kind regards, >> Fabian Sippl >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Wed Nov 7 04:17:04 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Wed, 7 Nov 2018 10:17:04 +0100 Subject: [scikit-learn] Random Forest Regressor -- Implementation in C++ In-Reply-To: References: Message-ID: You might also want to have a look at https://github.com/onnx/onnxmltools although I am not sure if there are RF optimized ONNX runtimes at this point. -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From emmanuelarias30 at gmail.com Wed Nov 7 07:01:20 2018 From: emmanuelarias30 at gmail.com (eamanu15) Date: Wed, 7 Nov 2018 09:01:20 -0300 Subject: [scikit-learn] scikit-learn Digest, Vol 32, Issue 5 In-Reply-To: References: Message-ID: Hello! I can help in the new estimator. Jackob, I will read your article and if you want we can start making a formal proposal to sklearn. Like say Andy, this sound like a case for scikit-learn-contrib Regards! Emmanuel -------------- next part -------------- An HTML attachment was scrubbed... URL: From immudzen at gmail.com Wed Nov 7 07:01:30 2018 From: immudzen at gmail.com (William Heymann) Date: Wed, 7 Nov 2018 13:01:30 +0100 Subject: [scikit-learn] KernelDensity bandwidth hyper parameter optimization Message-ID: Hello, I am trying to tune the bandwidth for my KernelDensity. I need to find out what optimization goal to use. I started with from sklearn.grid_search import GridSearchCVgrid = GridSearchCV(KernelDensity(), {'bandwidth': np.linspace(0.1, 1.0, 30)}, cv=20) # 20-fold cross-validationgrid.fit(x[:, None])print grid.best_params_ From https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/#Bandwidth-Cross-Validation-in-Scikit-Learn I have also used RandomizedSearchCV to optimize the parameters. The problem I have is that neither refines the answer so if I don't sample at high enough density I don't get a good answer. What I would like to do is use the same goal but put it into a different global optimizer. I have looked through the code for GridSearchCV and RandomizedSearchCV and I have not been able to figure out yet what is the actual optimization goal. Originally I thought the system was using something like kde_bw = KernelDensity(kernel='gaussian', bandwidth=bw) score = max(cross_val_score(kde_bw, data, cv=3)) and then trying to minimize that score but it does not seem likely given the results. If someone could help me with the goal to optimize I should be able to solve the rest of the problem on my own. Thanks Bill -------------- next part -------------- An HTML attachment was scrubbed... URL: From emmanuelarias30 at gmail.com Wed Nov 7 07:04:55 2018 From: emmanuelarias30 at gmail.com (eamanu15) Date: Wed, 7 Nov 2018 09:04:55 -0300 Subject: [scikit-learn] Outlier Detection: Contributing a new Estimator: Rank-Based Outlier Detection In-Reply-To: References: Message-ID: Ups, I forgot edit the subject. This my message: Hello! I can help in the new estimator. Jackob, I will read your article and if you want we can start making a formal proposal to sklearn. Like say Andy, this sound like a case for scikit-learn-contrib Regards! Emmanuel -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Nov 7 18:05:59 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 7 Nov 2018 18:05:59 -0500 Subject: [scikit-learn] KernelDensity bandwidth hyper parameter optimization In-Reply-To: References: Message-ID: <8914b70d-8087-0d48-3228-263397bc635d@gmail.com> On 11/7/18 4:01 AM, William Heymann wrote: > Hello, > > I am trying to tune the bandwidth for my KernelDensity. I need to find > out what optimization goal to use. > > I started with > > from sklearn.grid_search import GridSearchCV > grid = GridSearchCV(KernelDensity(), > {'bandwidth': np.linspace(0.1, 1.0, 30)}, > cv=20) # 20-fold cross-validation > grid.fit(x[:, None]) > print grid.best_params_ > > From > https://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/#Bandwidth-Cross-Validation-in-Scikit-Learn > > I have also used?RandomizedSearchCV to optimize the parameters. > > The problem I have is that neither refines the answer so if I don't > sample at high enough density I don't get a good answer. What I would > like to do is use the same goal but put it into a different global > optimizer. > > I have looked through the code for GridSearchCV and RandomizedSearchCV > and I have not been able to figure out yet what is the actual > optimization goal. > > Originally I thought the system was using something like > > kde_bw = KernelDensity(kernel='gaussian', bandwidth=bw) > score = max(cross_val_score(kde_bw, data, cv=3)) > That's basically what it's doing. It's maximizing the "score" method of KernelDensity. you could look at scikit-optimize for a more elaborate optimizer (or try using any of the scipy ones) -------------- next part -------------- An HTML attachment was scrubbed... URL: From myabakhova at gmail.com Thu Nov 8 15:03:30 2018 From: myabakhova at gmail.com (Maiia Bakhova) Date: Thu, 8 Nov 2018 12:03:30 -0800 Subject: [scikit-learn] Elbow method function for K-means procedure (Andreas Mueller) In-Reply-To: References: Message-ID: Hello Andreas, thanks for your reply! Sorry, I did not know about methods. I guess it is a recent development and I should have checked it out. I came up with my idea after advising to students on a Machine Learning Course to compute cosines, and by their reaction the idea is not so easy to code. I google it and found nothing for such implementation. The only mentioning of cosine for clustering is using cosine distance, so called cosine similarity. (I suspect that remembering the cosine formula and how cosine values can be used to compare angle values is difficult for some.) And length of a script is not necessarily is an indication of usefulness. Although if such type of evaluation for optimal cluster number is already covered by other methods then my input might be not needed. I will look up other methods using your link and see if I can contribute here. Best, Mya -- Maiia Bakhova Mathematician in Data Science https://www.linkedin.com/in/myabakhova From t3kcit at gmail.com Wed Nov 14 15:59:52 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 14 Nov 2018 15:59:52 -0500 Subject: [scikit-learn] make all new parameters keyword-only? Message-ID: Hi all. Since we'll be dropping Python2.7 soon, we can now use keyword-only arguments. I would argue that whenever we add any argument anywhere, we should make it keyword-only from now on, with the exception of X and y (probably). What do others think? Are there other features in Python3 that we should consider adopting for 0.21? The reason for making arguments keyword-only is that a) users are force to write more readable code b) deprecations and api changes have less side-effects Cheers, Andy From qinhanmin2005 at sina.com Wed Nov 14 20:36:28 2018 From: qinhanmin2005 at sina.com (Hanmin Qin) Date: Thu, 15 Nov 2018 09:36:28 +0800 Subject: [scikit-learn] make all new parameters keyword-only? Message-ID: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> I agree that this feature is advantageous and I'm +1 to apply it to new classes/functions, but for existing classes/functions, does it seem strange that only certain arguments are keyword only (i.e., some arguments can be specified by position, while others can't)? Hanmin Qin ----- Original Message ----- From: Andreas Mueller To: scikit-learn at python.org Subject: [scikit-learn] make all new parameters keyword-only? Date: 2018-11-15 05:01 Hi all. Since we'll be dropping Python2.7 soon, we can now use keyword-only arguments. I would argue that whenever we add any argument anywhere, we should make it keyword-only from now on, with the exception of X and y (probably). What do others think? Are there other features in Python3 that we should consider adopting for 0.21? The reason for making arguments keyword-only is that a) users are force to write more readable code b) deprecations and api changes have less side-effects Cheers, Andy _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Wed Nov 14 21:08:33 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 14 Nov 2018 21:08:33 -0500 Subject: [scikit-learn] make all new parameters keyword-only? In-Reply-To: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> References: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> Message-ID: <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> On 11/14/18 8:36 PM, Hanmin Qin wrote: > I agree that this feature is advantageous and I'm +1 to apply it to > new classes/functions, but for existing classes/functions, does it > seem strange that only certain arguments are keyword only (i.e., some > arguments?can be specified by position, while others can't)? > yes, but it would discourage users from specifying any by position, which I think they really shouldn't. No-one understands what RandomForestClassifier(100, gini, 5, 6, 7, .3, .4, 3, .1, .2, True)? means. And if we deprecate a parameter it might still run but mean something else. From joel.nothman at gmail.com Thu Nov 15 02:12:35 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 15 Nov 2018 18:12:35 +1100 Subject: [scikit-learn] make all new parameters keyword-only? In-Reply-To: <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> References: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> Message-ID: We could just announce that we will be making this a syntactic constraint from version X and make the change wholesale then. It would be less formal backwards compatibility than we usually hold by, but we already are loose with parameter ordering when adding new ones. It would be great if after this change we could then reorder parameters to make some sense! -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Thu Nov 15 04:01:55 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 15 Nov 2018 10:01:55 +0100 Subject: [scikit-learn] make all new parameters keyword-only? In-Reply-To: References: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> Message-ID: <20181115090155.gjgmj2ybyqhrzpbr@phare.normalesup.org> I am really in favor of the general idea: it is much better to use named arguments for everybody (for readability, and to be less depend on parameter ordering). However, I would maintain that we need to move slowly with backward compatibility: changing in a backward-incompatible way a library brings much more loss than benefit to our users. So +1 for enforcing the change on all new arguments, but -1 for changing orders in the existing arguments any time soon. I agree that it would be good to push this change in existing models. We should probably announce it strongly well in advance, make sure that all our examples are changed (people copy-paste), wait a lot, and find a moment to squeeze this in. Ga?l On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote: > We could just announce that we will be making this a syntactic constraint from > version X and make the change wholesale then. It would be less formal backwards > compatibility than we usually hold by, but we already are loose with parameter > ordering when adding new ones. > It would be great if after this change we could then reorder parameters to make > some sense! > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From joel.nothman at gmail.com Thu Nov 15 06:35:15 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Thu, 15 Nov 2018 22:35:15 +1100 Subject: [scikit-learn] make all new parameters keyword-only? In-Reply-To: <20181115090155.gjgmj2ybyqhrzpbr@phare.normalesup.org> References: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> <20181115090155.gjgmj2ybyqhrzpbr@phare.normalesup.org> Message-ID: I think there are cases where the first few arguments would be better to maintain as positional, but users would very rarely use more than two, and we have long assumed keyword arguments in most cases, and never received complaints when we have inserted not at the end or deprecated in the middle. I would expect that forcing all params after the first two (or three) to be keyword arguments would formalise existing tacit assumptions, would benefit maintainability, and would break very little code. On Thu, 15 Nov 2018 at 20:34, Gael Varoquaux wrote: > I am really in favor of the general idea: it is much better to use named > arguments for everybody (for readability, and to be less depend on > parameter ordering). > > However, I would maintain that we need to move slowly with backward > compatibility: changing in a backward-incompatible way a library brings > much more loss than benefit to our users. > > So +1 for enforcing the change on all new arguments, but -1 for changing > orders in the existing arguments any time soon. > > I agree that it would be good to push this change in existing models. We > should probably announce it strongly well in advance, make sure that all > our examples are changed (people copy-paste), wait a lot, and find a > moment to squeeze this in. > > Ga?l > > On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote: > > We could just announce that we will be making this a syntactic > constraint from > > version X and make the change wholesale then. It would be less formal > backwards > > compatibility than we usually hold by, but we already are loose with > parameter > > ordering when adding new ones. > > > It would be great if after this change we could then reorder parameters > to make > > some sense! > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Nov 15 08:59:08 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 15 Nov 2018 08:59:08 -0500 Subject: [scikit-learn] make all new parameters keyword-only? In-Reply-To: References: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> <20181115090155.gjgmj2ybyqhrzpbr@phare.normalesup.org> Message-ID: <67bbbcc9-640e-0f25-62c6-65855c5cb2a2@gmail.com> On 11/15/18 6:35 AM, Joel Nothman wrote: > I think there are cases where the first few arguments would be better > to maintain as positional, but users would very rarely use more than > two, and we have long assumed keyword arguments in most cases, and > never received complaints when we have inserted not at the end or > deprecated in the middle. > > I would expect that forcing all params after the first two (or three) > to be keyword arguments would formalise existing tacit assumptions, > would benefit maintainability, and would break very little code. > I was about to say "you expect but we have no way to measure that". But then I realized we totally have a way to measure that (if using the open source code on bigquery counts). I could try to see if people use positional arguments and where. No promise on timeline though. I think there is little harm in doing it for new parameters while we figure this out, though? > On Thu, 15 Nov 2018 at 20:34, Gael Varoquaux > > > wrote: > > I am really in favor of the general idea: it is much better to use > named > arguments for everybody (for readability, and to be less depend on > parameter ordering). > > However, I would maintain that we need to move slowly with backward > compatibility: changing in a backward-incompatible way a library > brings > much more loss than benefit to our users. > > So +1 for enforcing the change on all new arguments, but -1 for > changing > orders in the existing arguments any time soon. > > I agree that it would be good to push this change in existing > models. We > should probably announce it strongly well in advance, make sure > that all > our examples are changed (people copy-paste), wait a lot, and find a > moment to squeeze this in. > > Ga?l > > On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote: > > We could just announce that we will be making this a syntactic > constraint from > > version X and make the change wholesale then. It would be less > formal backwards > > compatibility than we usually hold by, but we already are loose > with parameter > > ordering when adding new ones. > > > It would be great if after this change we could then reorder > parameters to make > > some sense! > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > ? ? Gael Varoquaux > ? ? Senior Researcher, INRIA Parietal > ? ? NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > ? ? Phone:? ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Nov 15 09:14:30 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 15 Nov 2018 09:14:30 -0500 Subject: [scikit-learn] Next Sprint Message-ID: Hey folks. Are there any plans for a next sprint (in Paris/Europe?)? OpenML would like to join us next time and I think that would be cool. But they (and I ;) need some advance planning. I also have some funding that I could use for this as well. Cheers, Andy From gael.varoquaux at normalesup.org Thu Nov 15 09:40:15 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 15 Nov 2018 15:40:15 +0100 Subject: [scikit-learn] make all new parameters keyword-only? In-Reply-To: <67bbbcc9-640e-0f25-62c6-65855c5cb2a2@gmail.com> References: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> <20181115090155.gjgmj2ybyqhrzpbr@phare.normalesup.org> <67bbbcc9-640e-0f25-62c6-65855c5cb2a2@gmail.com> Message-ID: <20181115144015.fcr6cs7pe52kvyam@phare.normalesup.org> On Thu, Nov 15, 2018 at 08:59:08AM -0500, Andreas Mueller wrote: > I could try to see if people use positional arguments and where. No promise on > timeline though. If someone, you or someone else, does that, it would be very useful. > I think there is little harm in doing it for new parameters while we figure > this out, though? Totally! Ga?l > On Thu, 15 Nov 2018 at 20:34, Gael Varoquaux > wrote: > I am really in favor of the general idea: it is much better to use > named > arguments for everybody (for readability, and to be less depend on > parameter ordering). > However, I would maintain that we need to move slowly with backward > compatibility: changing in a backward-incompatible way a library brings > much more loss than benefit to our users. > So +1 for enforcing the change on all new arguments, but -1 for > changing > orders in the existing arguments any time soon. > I agree that it would be good to push this change in existing models. > We > should probably announce it strongly well in advance, make sure that > all > our examples are changed (people copy-paste), wait a lot, and find a > moment to squeeze this in. > Ga?l > On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote: > > We could just announce that we will be making this a syntactic > constraint from > > version X and make the change wholesale then. It would be less formal > backwards > > compatibility than we usually hold by, but we already are loose with > parameter > > ordering when adding new ones. > > It would be great if after this change we could then reorder > parameters to make > > some sense! > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From gael.varoquaux at normalesup.org Thu Nov 15 09:41:20 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Thu, 15 Nov 2018 15:41:20 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: References: Message-ID: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> On Thu, Nov 15, 2018 at 09:14:30AM -0500, Andreas Mueller wrote: > Are there any plans for a next sprint (in Paris/Europe?)? We're happy to host one in Paris, ideally in second half of February. Ga?l > OpenML would like to join us next time and I think that would be cool. > But they (and I ;) need some advance planning. > I also have some funding that I could use for this as well. > Cheers, > Andy > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From adrin.jalali at gmail.com Thu Nov 15 10:43:20 2018 From: adrin.jalali at gmail.com (Adrin) Date: Thu, 15 Nov 2018 16:43:20 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> Message-ID: Not sure if it counts and/or it'll be as fruitful as one of yours, but we're trying to gauge/raise the interest in a meetup this coming Tuesday in Berlin, and if there's interest, we'll try to organize one in Jan/Feb. Cheers, Adrin. On Thu, 15 Nov 2018 at 15:44 Gael Varoquaux wrote: > On Thu, Nov 15, 2018 at 09:14:30AM -0500, Andreas Mueller wrote: > > Are there any plans for a next sprint (in Paris/Europe?)? > > We're happy to host one in Paris, ideally in second half of February. > > Ga?l > > > OpenML would like to join us next time and I think that would be cool. > > But they (and I ;) need some advance planning. > > > I also have some funding that I could use for this as well. > > > Cheers, > > > Andy > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 <+33%201%2069%2008%2079%2068> > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Thu Nov 15 11:59:10 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Thu, 15 Nov 2018 11:59:10 -0500 Subject: [scikit-learn] Next Sprint In-Reply-To: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> Message-ID: <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> On 11/15/18 9:41 AM, Gael Varoquaux wrote: > On Thu, Nov 15, 2018 at 09:14:30AM -0500, Andreas Mueller wrote: >> Are there any plans for a next sprint (in Paris/Europe?)? > We're happy to host one in Paris, ideally in second half of February. > I have to teach, so I'd prefer summer. I'm teaching Monday and Wednesday. I can try to take one of these days off to go to Paris, but it's not my ideal scenario. So fly out Monday night, arrive Tuesday after lunch, fly out Sunday morning. Or fly out Wednesday night, come back Tuesday. Obviously there's more considerations than my schedule but it'd be great if I could join ;) Joaquin if you're reading this: how do you feel about second half of Feb? Munich can't to it, right? Joel: I assume you have other things to do? I could see if I'm allowed to buy you a business ticket ;) From joel.nothman at gmail.com Thu Nov 15 22:32:49 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Fri, 16 Nov 2018 14:32:49 +1100 Subject: [scikit-learn] Next Sprint In-Reply-To: <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> Message-ID: Ha! Well, it looks like I won't be teaching the NLP unit at my uni next year (would usually occupy me March-July), so there is no fundamental problem with disappearing in February, if I can get babysitters, and my boss, on board. (Although I am trying to plan another overseas trip for April, but that would be with kids...) -------------- next part -------------- An HTML attachment was scrubbed... URL: From jbbrown at kuhp.kyoto-u.ac.jp Thu Nov 15 23:18:46 2018 From: jbbrown at kuhp.kyoto-u.ac.jp (Brown J.B.) Date: Fri, 16 Nov 2018 13:18:46 +0900 Subject: [scikit-learn] make all new parameters keyword-only? In-Reply-To: <20181115090155.gjgmj2ybyqhrzpbr@phare.normalesup.org> References: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> <20181115090155.gjgmj2ybyqhrzpbr@phare.normalesup.org> Message-ID: As an end-user, I would strongly support the idea of future enforcement of keyword arguments for new parameters. In my group, we hold a standard that we develop APIs where _all_ arguments must be given by keyword (slightly pedantic style, but has shown to have benefits). Initialization/call-time state checks are done by a class' internal methods. As Andy said, one could consider leaving prototypical X,y as positional, but one benefit my group has seen with full keyword parameterization is the ability to write code for small investigations where we are more concerned with effects from parameters rather than the data (e.g., a fixed problem to model, and one wants to first see on the code line what the estimators and their parameterizations were). If one could shift the sklearn X,y to the back of a function call, it would enable all participants in a face-to-face code review session to quickly see the emphasis/context of the discussion and move to the conclusion faster. To satisfy keyword X,y as well, I would presume that the BaseEstimator would need to have a sanity check for error-raising default X,y values -- though does it not have many checks on X,y already? Not sure if everyone else agrees about keyword X and y, but just a thought for consideration. Kind regards, J.B. 2018?11?15?(?) 18:34 Gael Varoquaux : > I am really in favor of the general idea: it is much better to use named > arguments for everybody (for readability, and to be less depend on > parameter ordering). > > However, I would maintain that we need to move slowly with backward > compatibility: changing in a backward-incompatible way a library brings > much more loss than benefit to our users. > > So +1 for enforcing the change on all new arguments, but -1 for changing > orders in the existing arguments any time soon. > > I agree that it would be good to push this change in existing models. We > should probably announce it strongly well in advance, make sure that all > our examples are changed (people copy-paste), wait a lot, and find a > moment to squeeze this in. > > Ga?l > > On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote: > > We could just announce that we will be making this a syntactic > constraint from > > version X and make the change wholesale then. It would be less formal > backwards > > compatibility than we usually hold by, but we already are loose with > parameter > > ordering when adding new ones. > > > It would be great if after this change we could then reorder parameters > to make > > some sense! > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mail at sebastianraschka.com Thu Nov 15 23:31:07 2018 From: mail at sebastianraschka.com (Sebastian Raschka) Date: Thu, 15 Nov 2018 22:31:07 -0600 Subject: [scikit-learn] make all new parameters keyword-only? In-Reply-To: References: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> <20181115090155.gjgmj2ybyqhrzpbr@phare.normalesup.org> Message-ID: <0207E038-6790-48DB-BE99-21DCA4B2FC0C@sebastianraschka.com> Also want to say that I really welcome this decision/change. Personally, as far as I am aware, I've trying been using keyword arguments consistently for years, except for cases where it is really obvious, like .fit(X_train, y_train), and I believe that it really helped me regarding writing less error-prone code/analyses. Thinking back of the times where I was using MATLAB, it was really clunky and error-prone to import functions and being careful about the argument order. Besides, keynote arguments definitely make code and documentation much more readable (within and esp. across different package versions) despite (or maybe because) being more verbose. Best, Sebastian > On Nov 15, 2018, at 10:18 PM, Brown J.B. via scikit-learn wrote: > > As an end-user, I would strongly support the idea of future enforcement of keyword arguments for new parameters. > In my group, we hold a standard that we develop APIs where _all_ arguments must be given by keyword (slightly pedantic style, but has shown to have benefits). > Initialization/call-time state checks are done by a class' internal methods. > > As Andy said, one could consider leaving prototypical X,y as positional, but one benefit my group has seen with full keyword parameterization is the ability to write code for small investigations where we are more concerned with effects from parameters rather than the data (e.g., a fixed problem to model, and one wants to first see on the code line what the estimators and their parameterizations were). > If one could shift the sklearn X,y to the back of a function call, it would enable all participants in a face-to-face code review session to quickly see the emphasis/context of the discussion and move to the conclusion faster. > > To satisfy keyword X,y as well, I would presume that the BaseEstimator would need to have a sanity check for error-raising default X,y values -- though does it not have many checks on X,y already? > > Not sure if everyone else agrees about keyword X and y, but just a thought for consideration. > > Kind regards, > J.B. > > 2018?11?15?(?) 18:34 Gael Varoquaux : > I am really in favor of the general idea: it is much better to use named > arguments for everybody (for readability, and to be less depend on > parameter ordering). > > However, I would maintain that we need to move slowly with backward > compatibility: changing in a backward-incompatible way a library brings > much more loss than benefit to our users. > > So +1 for enforcing the change on all new arguments, but -1 for changing > orders in the existing arguments any time soon. > > I agree that it would be good to push this change in existing models. We > should probably announce it strongly well in advance, make sure that all > our examples are changed (people copy-paste), wait a lot, and find a > moment to squeeze this in. > > Ga?l > > On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote: > > We could just announce that we will be making this a syntactic constraint from > > version X and make the change wholesale then. It would be less formal backwards > > compatibility than we usually hold by, but we already are loose with parameter > > ordering when adding new ones. > > > It would be great if after this change we could then reorder parameters to make > > some sense! > > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > > -- > Gael Varoquaux > Senior Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Sat Nov 17 22:47:15 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Sat, 17 Nov 2018 22:47:15 -0500 Subject: [scikit-learn] Next Sprint In-Reply-To: References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> Message-ID: <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> I mean it would be amazing to have you there. Should we start nailing things down then? It's not that long until February. Looks like Hanmin can't make it. The NIPS sprint anniversary made me think maybe we should think about who else to invite. I have some funds we could use for paying for travel or anything else that might be useful. On 11/15/18 10:32 PM, Joel Nothman wrote: > > Ha! Well, it looks like I won't be teaching the NLP unit at my uni > next year (would usually occupy me March-July), so there is no > fundamental problem with disappearing in February, if I can get > babysitters, and my boss, on board. (Although I am trying to plan > another overseas trip for April, but that would be with kids...) > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Nov 18 05:07:17 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 18 Nov 2018 21:07:17 +1100 Subject: [scikit-learn] Next Sprint In-Reply-To: <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> Message-ID: When in Feb would we be talking? I'll start mooting it with stakeholders :) I'm hopeful, but not overly optimistic, that it could work. I should also note that weekdays work better than weekends for me, as I keep away from computers from Friday evening through to Saturday night. On Sun, 18 Nov 2018 at 14:48, Andreas Mueller wrote: > I mean it would be amazing to have you there. > > Should we start nailing things down then? It's not that long until > February. > > Looks like Hanmin can't make it. The NIPS sprint anniversary made me think > maybe we should think about who > else to invite. I have some funds we could use for paying for travel or > anything else that might be useful. > > > On 11/15/18 10:32 PM, Joel Nothman wrote: > > > Ha! Well, it looks like I won't be teaching the NLP unit at my uni next > year (would usually occupy me March-July), so there is no fundamental > problem with disappearing in February, if I can get babysitters, and my > boss, on board. (Although I am trying to plan another overseas trip for > April, but that would be with kids...) > > _______________________________________________ > scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Nov 18 05:14:22 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Sun, 18 Nov 2018 21:14:22 +1100 Subject: [scikit-learn] make all new parameters keyword-only? In-Reply-To: <0207E038-6790-48DB-BE99-21DCA4B2FC0C@sebastianraschka.com> References: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> <20181115090155.gjgmj2ybyqhrzpbr@phare.normalesup.org> <0207E038-6790-48DB-BE99-21DCA4B2FC0C@sebastianraschka.com> Message-ID: I think we're all agreed that this change would be a good thing. What we're not agreed on is how much risk we take by breaking legacy code that relied on argument order. I'd argue that we've often already broken such code, and that at least now it will break with a TypeError rather than silent misbehaviour. And yet Sebastian's comment implies that there may be a whole raft of former MATLAB users writing code without kwargs. Is that a problem if now they get a TypeError? On Fri, 16 Nov 2018 at 16:23, Sebastian Raschka wrote: > Also want to say that I really welcome this decision/change. Personally, > as far as I am aware, I've trying been using keyword arguments consistently > for years, except for cases where it is really obvious, like .fit(X_train, > y_train), and I believe that it really helped me regarding writing less > error-prone code/analyses. > > Thinking back of the times where I was using MATLAB, it was really clunky > and error-prone to import functions and being careful about the argument > order. > > Besides, keynote arguments definitely make code and documentation much > more readable (within and esp. across different package versions) despite > (or maybe because) being more verbose. > > Best, > Sebastian > > > > > On Nov 15, 2018, at 10:18 PM, Brown J.B. via scikit-learn < > scikit-learn at python.org> wrote: > > > > As an end-user, I would strongly support the idea of future enforcement > of keyword arguments for new parameters. > > In my group, we hold a standard that we develop APIs where _all_ > arguments must be given by keyword (slightly pedantic style, but has shown > to have benefits). > > Initialization/call-time state checks are done by a class' internal > methods. > > > > As Andy said, one could consider leaving prototypical X,y as positional, > but one benefit my group has seen with full keyword parameterization is the > ability to write code for small investigations where we are more concerned > with effects from parameters rather than the data (e.g., a fixed problem to > model, and one wants to first see on the code line what the estimators and > their parameterizations were). > > If one could shift the sklearn X,y to the back of a function call, it > would enable all participants in a face-to-face code review session to > quickly see the emphasis/context of the discussion and move to the > conclusion faster. > > > > To satisfy keyword X,y as well, I would presume that the BaseEstimator > would need to have a sanity check for error-raising default X,y values -- > though does it not have many checks on X,y already? > > > > Not sure if everyone else agrees about keyword X and y, but just a > thought for consideration. > > > > Kind regards, > > J.B. > > > > 2018?11?15?(?) 18:34 Gael Varoquaux : > > I am really in favor of the general idea: it is much better to use named > > arguments for everybody (for readability, and to be less depend on > > parameter ordering). > > > > However, I would maintain that we need to move slowly with backward > > compatibility: changing in a backward-incompatible way a library brings > > much more loss than benefit to our users. > > > > So +1 for enforcing the change on all new arguments, but -1 for changing > > orders in the existing arguments any time soon. > > > > I agree that it would be good to push this change in existing models. We > > should probably announce it strongly well in advance, make sure that all > > our examples are changed (people copy-paste), wait a lot, and find a > > moment to squeeze this in. > > > > Ga?l > > > > On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote: > > > We could just announce that we will be making this a syntactic > constraint from > > > version X and make the change wholesale then. It would be less formal > backwards > > > compatibility than we usually hold by, but we already are loose with > parameter > > > ordering when adding new ones. > > > > > It would be great if after this change we could then reorder > parameters to make > > > some sense! > > > > > _______________________________________________ > > > scikit-learn mailing list > > > scikit-learn at python.org > > > https://mail.python.org/mailman/listinfo/scikit-learn > > > > > > -- > > Gael Varoquaux > > Senior Researcher, INRIA Parietal > > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > > Phone: ++ 33-1-69-08-79-68 > > http://gael-varoquaux.info > http://twitter.com/GaelVaroquaux > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joel.nothman at gmail.com Sun Nov 18 17:15:34 2018 From: joel.nothman at gmail.com (Joel Nothman) Date: Mon, 19 Nov 2018 09:15:34 +1100 Subject: [scikit-learn] Next Sprint In-Reply-To: References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> Message-ID: and which would be better to aspire to? Paris in Feb, or Austin in July? On Sun, 18 Nov 2018 at 21:07, Joel Nothman wrote: > When in Feb would we be talking? I'll start mooting it with stakeholders > :) I'm hopeful, but not overly optimistic, that it could work. > > I should also note that weekdays work better than weekends for me, as I > keep away from computers from Friday evening through to Saturday night. > > On Sun, 18 Nov 2018 at 14:48, Andreas Mueller wrote: > >> I mean it would be amazing to have you there. >> >> Should we start nailing things down then? It's not that long until >> February. >> >> Looks like Hanmin can't make it. The NIPS sprint anniversary made me >> think maybe we should think about who >> else to invite. I have some funds we could use for paying for travel or >> anything else that might be useful. >> >> >> On 11/15/18 10:32 PM, Joel Nothman wrote: >> >> >> Ha! Well, it looks like I won't be teaching the NLP unit at my uni next >> year (would usually occupy me March-July), so there is no fundamental >> problem with disappearing in February, if I can get babysitters, and my >> boss, on board. (Although I am trying to plan another overseas trip for >> April, but that would be with kids...) >> >> _______________________________________________ >> scikit-learn mailing listscikit-learn at python.orghttps://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oliverrausch99 at gmail.com Mon Nov 19 09:50:17 2018 From: oliverrausch99 at gmail.com (Oliver Rausch) Date: Mon, 19 Nov 2018 15:50:17 +0100 Subject: [scikit-learn] Difference between CircleCI python2 and python3 Message-ID: <20181119155017.5a79bdc1@lydia> Hi all, I'm working on a PR [1] and the Circle python2 build [2] doesn't complete, while the python3 build does. Locally, building using "make html" also works. What is the difference between these two builds, and how could I locally debug the issue? Thanks, Oliver --- [1] https://github.com/scikit-learn/scikit-learn/pull/11682 [2] https://circleci.com/gh/scikit-learn/scikit-learn/38492 From qinhanmin2005 at sina.com Mon Nov 19 10:23:20 2018 From: qinhanmin2005 at sina.com (Hanmin Qin) Date: Mon, 19 Nov 2018 23:23:20 +0800 Subject: [scikit-learn] Difference between CircleCI python2 and python3 Message-ID: <20181119152320.D77FE464009D@webmail.sinamail.sina.com.cn> Thanks for the great PR.We're using matplotlib 3.0.1 (latest version) in python3 build and matplotlib 1.4.3 (minimal dependency) in python2 build.I think color="tab:blue" is not supported by matplotlib 1.X, so maybe you can try to use some simple colors. Hanmin Qin ----- Original Message ----- From: Oliver Rausch To: scikit-learn at python.org Subject: [scikit-learn] Difference between CircleCI python2 and python3 Date: 2018-11-19 22:52 Hi all, I'm working on a PR [1] and the Circle python2 build [2] doesn't complete, while the python3 build does. Locally, building using "make html" also works. What is the difference between these two builds, and how could I locally debug the issue? Thanks, Oliver --- [1] https://github.com/scikit-learn/scikit-learn/pull/11682 [2] https://circleci.com/gh/scikit-learn/scikit-learn/38492 _______________________________________________ scikit-learn mailing list scikit-learn at python.org https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From oliverrausch99 at gmail.com Mon Nov 19 11:40:58 2018 From: oliverrausch99 at gmail.com (Oliver Rausch) Date: Mon, 19 Nov 2018 17:40:58 +0100 Subject: [scikit-learn] Difference between CircleCI python2 and python3 In-Reply-To: <20181119152320.D77FE464009D@webmail.sinamail.sina.com.cn> References: <20181119152320.D77FE464009D@webmail.sinamail.sina.com.cn> Message-ID: <20181119174058.108c9fc4@lydia> Yes, that was the issue, Thanks. Oliver On Mon, 19 Nov 2018 23:23:20 +0800 "Hanmin Qin" wrote: > Thanks for the great PR.We're using matplotlib 3.0.1 (latest version) > in python3 build and matplotlib 1.4.3 (minimal dependency) in python2 > build.I think color="tab:blue" is not supported by matplotlib 1.X, so > maybe you can try to use some simple colors. Hanmin Qin ----- > Original Message ----- From: Oliver Rausch > To: scikit-learn at python.org Subject: [scikit-learn] Difference > between CircleCI python2 and python3 Date: 2018-11-19 22:52 > > > Hi all, > I'm working on a PR [1] and the Circle python2 build [2] doesn't > complete, while the python3 build does. Locally, building using "make > html" also works. What is the difference between these two builds, and > how could I locally debug the issue? > Thanks, > Oliver > --- > [1] https://github.com/scikit-learn/scikit-learn/pull/11682 > [2] https://circleci.com/gh/scikit-learn/scikit-learn/38492 > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From t3kcit at gmail.com Mon Nov 19 19:16:03 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 19 Nov 2018 19:16:03 -0500 Subject: [scikit-learn] Next Sprint In-Reply-To: References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> Message-ID: <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com> Paris will have more scikit-learn people, Austin will have numpy, scipy, matplotlib and Jupyter people as well. On 11/18/18 5:15 PM, Joel Nothman wrote: > and which would be better to aspire to? Paris in Feb, or Austin in July? > > On Sun, 18 Nov 2018 at 21:07, Joel Nothman > wrote: > > When in Feb would we be talking? I'll start mooting it with > stakeholders :) I'm hopeful, but not overly optimistic, that it > could work. > > I should also note that weekdays work better than weekends for me, > as I keep away from computers from Friday evening through to > Saturday night. > > On Sun, 18 Nov 2018 at 14:48, Andreas Mueller > wrote: > > I mean it would be amazing to have you there. > > Should we start nailing things down then? It's not that long > until February. > > Looks like Hanmin can't make it. The NIPS sprint anniversary > made me think maybe we should think about who > else to invite. I have some funds we could use for paying for > travel or anything else that might be useful. > > > On 11/15/18 10:32 PM, Joel Nothman wrote: >> >> Ha! Well, it looks like I won't be teaching the NLP unit at >> my uni next year (would usually occupy me March-July), so >> there is no fundamental problem with disappearing in >> February, if I can get babysitters, and my boss, on board. >> (Although I am trying to plan another overseas trip for >> April, but that would be with kids...) >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Tue Nov 20 14:15:07 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Tue, 20 Nov 2018 20:15:07 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com> References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com> Message-ID: We can also do Paris in April / May or June if that's ok with Joel and better for Andreas. I am teaching on Fridays from end of January to March. But I can miss half a day of sprint to teach my class. -- Olivier -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Nov 20 14:25:19 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 20 Nov 2018 20:25:19 +0100 Subject: [scikit-learn] Next Sprint In-Reply-To: References: <20181115144120.nnpufumsosmpamov@phare.normalesup.org> <10d2f8f2-efbf-c72d-2d96-b1b585003a47@gmail.com> <621b1350-2112-e8b0-c7f1-cbc739f0262e@gmail.com> <7f141338-1d9f-e516-3d7e-cb8232a0720f@gmail.com> Message-ID: <20181120192519.gbagzrvzzqljglme@phare.normalesup.org> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote: > We can also do Paris in April / May or June if that's ok with Joel and better > for Andreas. Absolutely. My thoughts here are that I want to minimize transportation, partly because flying has a large carbon footprint. Also, for personal reasons, I am not sure that I will be able to make it to Austin in July, but I realize that this is a pretty bad argument. We're happy to try to host in Paris whenever it's most convenient and to try to help with travel for those not in Paris. Ga?l From jorisvandenbossche at gmail.com Tue Nov 20 15:07:27 2018 From: jorisvandenbossche at gmail.com (Joris Van den Bossche) Date: Tue, 20 Nov 2018 21:07:27 +0100 Subject: [scikit-learn] make all new parameters keyword-only? In-Reply-To: References: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> <20181115090155.gjgmj2ybyqhrzpbr@phare.normalesup.org> <0207E038-6790-48DB-BE99-21DCA4B2FC0C@sebastianraschka.com> Message-ID: Op zo 18 nov. 2018 om 11:14 schreef Joel Nothman : > I think we're all agreed that this change would be a good thing. > > What we're not agreed on is how much risk we take by breaking legacy code > that relied on argument order. > I think that, in principle, it could be possible to do this with a deprecation warning. If we would do a signature like the following: class Model(BaseEstimator): def __init__(self, *args, param1=1, param2=2): then we could in principle catch all positional args, raise a warning if there are any, and by inspecting the signature (as we now also do in _get_param_names), we could set the appropriate parameters on self. I think the main problem is that this would temporarily "allow" that people also pass a keyword argument that conflicts with a positional argument without that it raises an error (as Python normally would do for you), but you still get the warning. And of course, it would violate the clean __init__ functions in scikit-learn that do no validation. I personally don't know how big the impact would be of simply doing it as breaking change, but if we think it might be potentially quite big, the above might be worth considering (otherwise I wouldn't go through the hassle). Joris > > I'd argue that we've often already broken such code, and that at least now > it will break with a TypeError rather than silent misbehaviour. > > And yet Sebastian's comment implies that there may be a whole raft of > former MATLAB users writing code without kwargs. Is that a problem if now > they get a TypeError? > > On Fri, 16 Nov 2018 at 16:23, Sebastian Raschka > wrote: > >> Also want to say that I really welcome this decision/change. Personally, >> as far as I am aware, I've trying been using keyword arguments consistently >> for years, except for cases where it is really obvious, like .fit(X_train, >> y_train), and I believe that it really helped me regarding writing less >> error-prone code/analyses. >> >> Thinking back of the times where I was using MATLAB, it was really clunky >> and error-prone to import functions and being careful about the argument >> order. >> >> Besides, keynote arguments definitely make code and documentation much >> more readable (within and esp. across different package versions) despite >> (or maybe because) being more verbose. >> >> Best, >> Sebastian >> >> >> >> > On Nov 15, 2018, at 10:18 PM, Brown J.B. via scikit-learn < >> scikit-learn at python.org> wrote: >> > >> > As an end-user, I would strongly support the idea of future enforcement >> of keyword arguments for new parameters. >> > In my group, we hold a standard that we develop APIs where _all_ >> arguments must be given by keyword (slightly pedantic style, but has shown >> to have benefits). >> > Initialization/call-time state checks are done by a class' internal >> methods. >> > >> > As Andy said, one could consider leaving prototypical X,y as >> positional, but one benefit my group has seen with full keyword >> parameterization is the ability to write code for small investigations >> where we are more concerned with effects from parameters rather than the >> data (e.g., a fixed problem to model, and one wants to first see on the >> code line what the estimators and their parameterizations were). >> > If one could shift the sklearn X,y to the back of a function call, it >> would enable all participants in a face-to-face code review session to >> quickly see the emphasis/context of the discussion and move to the >> conclusion faster. >> > >> > To satisfy keyword X,y as well, I would presume that the BaseEstimator >> would need to have a sanity check for error-raising default X,y values -- >> though does it not have many checks on X,y already? >> > >> > Not sure if everyone else agrees about keyword X and y, but just a >> thought for consideration. >> > >> > Kind regards, >> > J.B. >> > >> > 2018?11?15?(?) 18:34 Gael Varoquaux : >> > I am really in favor of the general idea: it is much better to use named >> > arguments for everybody (for readability, and to be less depend on >> > parameter ordering). >> > >> > However, I would maintain that we need to move slowly with backward >> > compatibility: changing in a backward-incompatible way a library brings >> > much more loss than benefit to our users. >> > >> > So +1 for enforcing the change on all new arguments, but -1 for changing >> > orders in the existing arguments any time soon. >> > >> > I agree that it would be good to push this change in existing models. We >> > should probably announce it strongly well in advance, make sure that all >> > our examples are changed (people copy-paste), wait a lot, and find a >> > moment to squeeze this in. >> > >> > Ga?l >> > >> > On Thu, Nov 15, 2018 at 06:12:35PM +1100, Joel Nothman wrote: >> > > We could just announce that we will be making this a syntactic >> constraint from >> > > version X and make the change wholesale then. It would be less formal >> backwards >> > > compatibility than we usually hold by, but we already are loose with >> parameter >> > > ordering when adding new ones. >> > >> > > It would be great if after this change we could then reorder >> parameters to make >> > > some sense! >> > >> > > _______________________________________________ >> > > scikit-learn mailing list >> > > scikit-learn at python.org >> > > https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > >> > -- >> > Gael Varoquaux >> > Senior Researcher, INRIA Parietal >> > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France >> > Phone: ++ 33-1-69-08-79-68 >> > http://gael-varoquaux.info >> http://twitter.com/GaelVaroquaux >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn at python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Nov 20 15:58:18 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 20 Nov 2018 21:58:18 +0100 Subject: [scikit-learn] ANN Dirty_cat: learning on dirty categories Message-ID: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> Hi scikit-learn friends, As you might have seen on twitter, my lab -with a few friends- has embarked on research to ease machine on "dirty data". We are experimenting on new encoding methods for non-curated string categories. For this, we are developing a small software project called "dirty_cat": https://dirty-cat.github.io/stable/ dirty_cat is a test bed for new ideas of "dirty categories". It is a research project, though we still try to do decent software engineering :). Rather than contributing to existing codebases (as the great categorical-encoding project in scikit-learn-contrib), we spanned it out in a separate software project to have the freedom to try out ideas that we might give up after gaining insight. We hope that it is a useful tool: if you have non-curated string categories, please give it a try. Understanding what works and what does not is important to know what to consolidate. Hopefully one day we can develop a tool that is of wide-enough interest that it can go in scikit-learn-contrib, or maybe even scikit-learn. Also, if you have suggestions of publicly available databases that we try it upon, we would love to hear from you. Cheers, Ga?l PS: if you want to work on dirty-data problems in Paris as a post-doc or an engineer, send me a line From t3kcit at gmail.com Tue Nov 20 16:06:30 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 20 Nov 2018 16:06:30 -0500 Subject: [scikit-learn] ANN Dirty_cat: learning on dirty categories In-Reply-To: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> Message-ID: I would love to see the TargetEncoder ported to scikit-learn. The CountFeaturizer is pretty stalled: https://github.com/scikit-learn/scikit-learn/pull/9614 :-/ Have you benchmarked the other encoders in the category_encoding lib? I would be really curious to know when/how they help. On 11/20/18 3:58 PM, Gael Varoquaux wrote: > Hi scikit-learn friends, > > As you might have seen on twitter, my lab -with a few friends- has > embarked on research to ease machine on "dirty data". We are > experimenting on new encoding methods for non-curated string categories. > For this, we are developing a small software project called "dirty_cat": > https://dirty-cat.github.io/stable/ > > dirty_cat is a test bed for new ideas of "dirty categories". It is a > research project, though we still try to do decent software engineering > :). Rather than contributing to existing codebases (as the great > categorical-encoding project in scikit-learn-contrib), we spanned it out > in a separate software project to have the freedom to try out ideas that > we might give up after gaining insight. > > We hope that it is a useful tool: if you have non-curated string > categories, please give it a try. Understanding what works and what does > not is important to know what to consolidate. Hopefully one day we can > develop a tool that is of wide-enough interest that it can go in > scikit-learn-contrib, or maybe even scikit-learn. > > Also, if you have suggestions of publicly available databases that we try > it upon, we would love to hear from you. > > Cheers, > > Ga?l > > PS: if you want to work on dirty-data problems in Paris as a post-doc or > an engineer, send me a line > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn From gael.varoquaux at normalesup.org Tue Nov 20 16:16:06 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 20 Nov 2018 22:16:06 +0100 Subject: [scikit-learn] ANN Dirty_cat: learning on dirty categories In-Reply-To: References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> Message-ID: <20181120211606.upltvviobudlurxe@phare.normalesup.org> On Tue, Nov 20, 2018 at 04:06:30PM -0500, Andreas Mueller wrote: > I would love to see the TargetEncoder ported to scikit-learn. > The CountFeaturizer is pretty stalled: > https://github.com/scikit-learn/scikit-learn/pull/9614 So would I. But there are several ways of doing it: - the naive way is not the right one: just computing the average of y for each category leads to overfitting quite fast - it can be done cross-validated, splitting the train data, in a "cross-fit" strategy (see https://github.com/dirty-cat/dirty_cat/issues/53) - it can be done using empirical-Bayes shrinkage, which is what we currently do in dirty_cat. We are planning to do heavy benchmarking of those strategies, to figure out tradeoff. But we won't get to it before February, I am afraid. > Have you benchmarked the other encoders in the category_encoding lib? > I would be really curious to know when/how they help. We did (part of the results are in the publication), and we didn't have great success. Ga?l > On 11/20/18 3:58 PM, Gael Varoquaux wrote: > > Hi scikit-learn friends, > > As you might have seen on twitter, my lab -with a few friends- has > > embarked on research to ease machine on "dirty data". We are > > experimenting on new encoding methods for non-curated string categories. > > For this, we are developing a small software project called "dirty_cat": > > https://dirty-cat.github.io/stable/ > > dirty_cat is a test bed for new ideas of "dirty categories". It is a > > research project, though we still try to do decent software engineering > > :). Rather than contributing to existing codebases (as the great > > categorical-encoding project in scikit-learn-contrib), we spanned it out > > in a separate software project to have the freedom to try out ideas that > > we might give up after gaining insight. > > We hope that it is a useful tool: if you have non-curated string > > categories, please give it a try. Understanding what works and what does > > not is important to know what to consolidate. Hopefully one day we can > > develop a tool that is of wide-enough interest that it can go in > > scikit-learn-contrib, or maybe even scikit-learn. > > Also, if you have suggestions of publicly available databases that we try > > it upon, we would love to hear from you. > > Cheers, > > Ga?l > > PS: if you want to work on dirty-data problems in Paris as a post-doc or > > an engineer, send me a line > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn at python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Senior Researcher, INRIA Parietal NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France Phone: ++ 33-1-69-08-79-68 http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From t3kcit at gmail.com Tue Nov 20 16:35:43 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 20 Nov 2018 16:35:43 -0500 Subject: [scikit-learn] ANN Dirty_cat: learning on dirty categories In-Reply-To: <20181120211606.upltvviobudlurxe@phare.normalesup.org> References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> Message-ID: <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> On 11/20/18 4:16 PM, Gael Varoquaux wrote: > - the naive way is not the right one: just computing the average of y > for each category leads to overfitting quite fast > > - it can be done cross-validated, splitting the train data, in a > "cross-fit" strategy (seehttps://github.com/dirty-cat/dirty_cat/issues/53) This is called leave-one-out in the category_encoding library, I think, and that's what my first implementation would be. > > - it can be done using empirical-Bayes shrinkage, which is what we > currently do in dirty_cat. Reference / explanation? > > We are planning to do heavy benchmarking of those strategies, to figure > out tradeoff. But we won't get to it before February, I am afraid. aww ;) From gael.varoquaux at normalesup.org Tue Nov 20 16:43:37 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 20 Nov 2018 22:43:37 +0100 Subject: [scikit-learn] ANN Dirty_cat: learning on dirty categories In-Reply-To: <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> Message-ID: <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> On Tue, Nov 20, 2018 at 04:35:43PM -0500, Andreas Mueller wrote: > > - it can be done cross-validated, splitting the train data, in a > > "cross-fit" strategy (seehttps://github.com/dirty-cat/dirty_cat/issues/53) > This is called leave-one-out in the category_encoding library, I think, > and that's what my first implementation would be. > > - it can be done using empirical-Bayes shrinkage, which is what we > > currently do in dirty_cat. > Reference / explanation? I think that a good reference is the prior art part of our paper: https://arxiv.org/abs/1806.00979 But we found the following reference helpful Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter 3(1), 27?32 (2001) > > We are planning to do heavy benchmarking of those strategies, to figure > > out tradeoff. But we won't get to it before February, I am afraid. > aww ;) Yeah. I do slow science. Slow everything, actually :(. Ga?l From olivier.grisel at ensta.org Tue Nov 20 16:46:55 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Tue, 20 Nov 2018 22:46:55 +0100 Subject: [scikit-learn] make all new parameters keyword-only? In-Reply-To: References: <20181115013628.4C9B8464009D@webmail.sinamail.sina.com.cn> <58a12a42-2bde-0f12-fa33-f63876e14405@gmail.com> <20181115090155.gjgmj2ybyqhrzpbr@phare.normalesup.org> <0207E038-6790-48DB-BE99-21DCA4B2FC0C@sebastianraschka.com> Message-ID: +1 on the ideal in general (and to enforce this on new classes / params). +1 to be conservative and not break existing code. Le mar. 20 nov. 2018 ? 21:09, Joris Van den Bossche < jorisvandenbossche at gmail.com> a ?crit : > Op zo 18 nov. 2018 om 11:14 schreef Joel Nothman : > >> I think we're all agreed that this change would be a good thing. >> >> What we're not agreed on is how much risk we take by breaking legacy code >> that relied on argument order. >> > > I think that, in principle, it could be possible to do this with a > deprecation warning. If we would do a signature like the following: > > class Model(BaseEstimator): > def __init__(self, *args, param1=1, param2=2): > > then we could in principle catch all positional args, raise a warning if > there are any, and by inspecting the signature (as we now also do in > _get_param_names), we could set the appropriate parameters on self. > I think the main problem is that this would temporarily "allow" that > people also pass a keyword argument that conflicts with a positional > argument without that it raises an error (as Python normally would do for > you), but you still get the warning. > And of course, it would violate the clean __init__ functions in > scikit-learn that do no validation. > > I personally don't know how big the impact would be of simply doing it as > breaking change, but if we think it might be potentially quite big, the > above might be worth considering (otherwise I wouldn't go through the > hassle). > That would render sphinx API doc and IDE prototype tooltips confusing though. But maybe we if use an explicit name such as: *deprecated_positional_args that would be fine enough. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel -------------- next part -------------- An HTML attachment was scrubbed... URL: From t3kcit at gmail.com Tue Nov 20 21:58:49 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 20 Nov 2018 21:58:49 -0500 Subject: [scikit-learn] ANN Dirty_cat: learning on dirty categories In-Reply-To: <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> Message-ID: <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> On 11/20/18 4:43 PM, Gael Varoquaux wrote: > We are planning to do heavy benchmarking of those strategies, to figure > out tradeoff. But we won't get to it before February, I am afraid. Does that mean you'd be opposed to adding the leave-one-out TargetEncoder before you do this? I would really like to add it before February and it's pretty established. From gael.varoquaux at normalesup.org Wed Nov 21 00:38:18 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 21 Nov 2018 06:38:18 +0100 Subject: [scikit-learn] ANN Dirty_cat: learning on dirty categories In-Reply-To: <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> Message-ID: <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org> On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote: > On 11/20/18 4:43 PM, Gael Varoquaux wrote: > > We are planning to do heavy benchmarking of those strategies, to figure > > out tradeoff. But we won't get to it before February, I am afraid. > Does that mean you'd be opposed to adding the leave-one-out TargetEncoder I'd rather not. Or rather, I'd rather have some benchmarks on it (it doesn't have to be us that does it). > I would really like to add it before February A few month to get it right is not that bad, is it? > and it's pretty established. Are there good references studying it? If they is a clear track of study, it falls in the usual rules, and should go in. Ga?l From t3kcit at gmail.com Wed Nov 21 09:47:13 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 21 Nov 2018 09:47:13 -0500 Subject: [scikit-learn] ANN Dirty_cat: learning on dirty categories In-Reply-To: <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org> References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org> Message-ID: On 11/21/18 12:38 AM, Gael Varoquaux wrote: > On Tue, Nov 20, 2018 at 09:58:49PM -0500, Andreas Mueller wrote: > >> On 11/20/18 4:43 PM, Gael Varoquaux wrote: >>> We are planning to do heavy benchmarking of those strategies, to figure >>> out tradeoff. But we won't get to it before February, I am afraid. >> Does that mean you'd be opposed to adding the leave-one-out TargetEncoder > I'd rather not. Or rather, I'd rather have some benchmarks on it (it > doesn't have to be us that does it). > >> I would really like to add it before February > A few month to get it right is not that bad, is it? The PR is over a year old already, and you hadn't voiced any opposition there. From gael.varoquaux at normalesup.org Wed Nov 21 10:34:24 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 21 Nov 2018 16:34:24 +0100 Subject: [scikit-learn] ANN Dirty_cat: learning on dirty categories In-Reply-To: References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org> Message-ID: <20181121153424.i3b7orguqhm243el@phare.normalesup.org> On Wed, Nov 21, 2018 at 09:47:13AM -0500, Andreas Mueller wrote: > The PR is over a year old already, and you hadn't voiced any opposition > there. My bad, sorry. Given the name, I had not guessed the link between the PR and encoding of categorical features. I find myself very much in agreement with the original issue and its discussion: https://github.com/scikit-learn/scikit-learn/issues/5853 concerns about the name and importance of at least considering prior smoothing. I do not see these reflected in the PR. In general, the fact that there is not much literature on this implies that we should be benchmarking our choices. The more I understand kaggle, the less I think that we can fully use it as an inclusion argument: people do transforms that end up to be very specific to one challenge. On the specific problem of categorical encoding, we've tried to do systematic analysis of some of these, and were not very successful empirically (eg hashing encoding). This is not at all a vote against target encoding, which our benchmarks showed was very useful, but just a push for benchmarking PRs, in particular when they do not correspond to well cited work (which is our standard inclusion criterion). Joris has just accepted to help with benchmarking. We can have preliminary results earlier. The question really is: out of the different variants that exist, which one should we choose. I think that it is a legitimate question that arises on many of our PRs. But in general, I don't think that we should rush things because of deadlines. Consequences of a rush are that we need to change things after merge, which is more work. I know that it is slow, but we are quite a central package. Ga?l From t3kcit at gmail.com Wed Nov 21 11:35:11 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Wed, 21 Nov 2018 11:35:11 -0500 Subject: [scikit-learn] ANN Dirty_cat: learning on dirty categories In-Reply-To: <20181121153424.i3b7orguqhm243el@phare.normalesup.org> References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org> <20181121153424.i3b7orguqhm243el@phare.normalesup.org> Message-ID: <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com> On 11/21/18 10:34 AM, Gael Varoquaux wrote: > Joris has just accepted to help with benchmarking. We can have > preliminary results earlier. The question really is: out of the different > variants that exist, which one should we choose. I think that it is a > legitimate question that arises on many of our PRs. Thanks Joris! I could also ask Jan to help ;) The question for this particular issue for me is also "what are good benchmark datasets". It's a somewhat different task than what you're benchmarking with dirty cat, right? In dirty cat you used dirty categories, which is a subset of all high-cardinality categorical variables. Whether "clean" high cardinality variables like zip-codes or dirty ones are the better benchmark is a bit unclear to me, and I'm not aware of a wealth of datasets for either :-/ > > But in general, I don't think that we should rush things because of > deadlines. Consequences of a rush are that we need to change things after > merge, which is more work. I know that it is slow, but we are quite a > central package. I agree. From gael.varoquaux at normalesup.org Fri Nov 23 03:47:11 2018 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Fri, 23 Nov 2018 09:47:11 +0100 Subject: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories In-Reply-To: <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com> References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org> <20181121153424.i3b7orguqhm243el@phare.normalesup.org> <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com> Message-ID: <20181123084711.l22vhrbwikr5hamh@phare.normalesup.org> On Wed, Nov 21, 2018 at 11:35:11AM -0500, Andreas Mueller wrote: > The question for this particular issue for me is also "what are good > benchmark datasets". > In dirty cat you used dirty categories, which is a subset of all > high-cardinality categorical > variables. > Whether "clean" high cardinality variables like zip-codes or dirty ones are > the better > benchmark is a bit unclear to me, and I'm not aware of a wealth of datasets > for either :-/ Fair point. We'll have a look to see what we can find. We're open to suggestions, from you or from anyone else. G From olivier.grisel at ensta.org Fri Nov 23 17:12:23 2018 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 23 Nov 2018 23:12:23 +0100 Subject: [scikit-learn] benchmarking TargetEncoder Was: ANN Dirty_cat: learning on dirty categories In-Reply-To: <20181123084711.l22vhrbwikr5hamh@phare.normalesup.org> References: <20181120205818.vgm5fses2nprgvnl@phare.normalesup.org> <20181120211606.upltvviobudlurxe@phare.normalesup.org> <652a4474-c10c-0df9-e314-e16a415b59b8@gmail.com> <20181120214337.7unwskh7wtei4kj5@phare.normalesup.org> <4c5189a8-4beb-933f-1582-29c964c1cec4@gmail.com> <20181121053818.zwjmj6zgwharwpgp@phare.normalesup.org> <20181121153424.i3b7orguqhm243el@phare.normalesup.org> <52d96d5f-be24-20b0-707d-4e13b1494f38@gmail.com> <20181123084711.l22vhrbwikr5hamh@phare.normalesup.org> Message-ID: Maybe a subset of the criteo TB dataset? -------------- next part -------------- An HTML attachment was scrubbed... URL: From matthieu.brucher at gmail.com Sun Nov 25 04:59:04 2018 From: matthieu.brucher at gmail.com (Matthieu Brucher) Date: Sun, 25 Nov 2018 09:59:04 +0000 Subject: [scikit-learn] Recurrent questions about speed for TfidfVectorizer Message-ID: Hi all, I've noticed a few questions online (mainly SO) on TfidfVectorizer speed, and I was wondering about the global effort on speeding up sklearn. Is there something I can help on this topic (Cython?), as well as a discussion on this tough subject? Cheers, Matthieu -- Quantitative analyst, Ph.D. Blog: http://blog.audio-tk.com/ LinkedIn: http://www.linkedin.com/in/matthieubrucher -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at pm.me Mon Nov 26 08:39:08 2018 From: rth.yurchak at pm.me (Roman Yurchak) Date: Mon, 26 Nov 2018 13:39:08 +0000 Subject: [scikit-learn] Recurrent questions about speed for TfidfVectorizer In-Reply-To: References: Message-ID: Hi Matthieu, if you are interested in general questions regarding improving scikit-learn performance, you might be want to have a look at the draft roadmap https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- there is a lot topics where suggestions / PRs on improving performance would be very welcome. For the particular case of TfidfVectorizer, it is a bit different from the rest of the scikit-learn code base in the sense that it's not limited by the performance of numerical calculation but rather that of string processing and counting. TfidfVectorizer is equivalent to CountVectorizer + TfidfTransformer and the later has only a marginal computational cost. As to CountVectorizer, last time I checked, its profiling was something along the lines of, - part regexp for tokenization (see token_pattern.findall) - part token counting (see CountVectorizer._count_vocab) - and a comparable part for all the rest Because of that, porting it to Cython is not that immediate, as one is still going to use CPython regexp and token counting in a dict. For instance, HashingVectorizer implements token counting in Cython -- it's faster but not that much faster. Using C++ maps or some less common structures have been discussed in https://github.com/scikit-learn/scikit-learn/issues/2639 Currently, I think, there are ~3 main ways performance could be improved, 1. Optimize the current implementation while remaining in Python. Possible but IMO would require some effort, because there are not much low hanging fruits left there. Though a new look would definitely be good. 2. Parallelize computations. There was some earlier discussion about this in scikit-learn issues, but at present, the better way would probably be to add it in dask-ml (see https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already supported. Someone would need to implement CountVectorizer. 3. Rewrite part of the implementation in a lower level language (e.g. Cython). The question is how maintainable that would be, and whether the performance gains would be worth it. Now that Python 2 will be dropped, at least not having to deal with Py2/3 compatibility for strings in Cython might make things a bit easier. Though, if the processing is in Cython it might also make using custom tokenizers/analyzers more difficult. On a related topic, I have been experimenting with implementing part of this processing in Rust lately: https://github.com/rth/text-vectorize. So far it looks promising. Though, of course, it will remain a separate project because of language constraints in scikit-learn. In general if you have thoughts on things that can be improved, don't hesitate to open issues, -- Roman On 25/11/2018 10:59, Matthieu Brucher wrote: > Hi all, > > I've noticed a few questions online (mainly SO) on?TfidfVectorizer > speed, and I was wondering about the global effort on speeding up sklearn. > Is there something I can help on this topic (Cython?), as well as a > discussion on this tough subject? > > Cheers, > > Matthieu > -- > Quantitative analyst, Ph.D. > Blog: http://blog.audio-tk.com/ > LinkedIn: http://www.linkedin.com/in/matthieubrucher From t3kcit at gmail.com Mon Nov 26 10:28:13 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Mon, 26 Nov 2018 10:28:13 -0500 Subject: [scikit-learn] Recurrent questions about speed for TfidfVectorizer In-Reply-To: References: Message-ID: <46dd3561-a70c-ea18-282f-26d34b87cf06@gmail.com> I think tries might be an interesting datastructure, but it really depends on where the bottleneck is. I'm really surprised they are not used more, but maybe that's just because implementations are missing? On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote: > Hi Matthieu, > > if you are interested in general questions regarding improving > scikit-learn performance, you might be want to have a look at the draft > roadmap > https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- > there is a lot topics where suggestions / PRs on improving performance > would be very welcome. > > For the particular case of TfidfVectorizer, it is a bit different from > the rest of the scikit-learn code base in the sense that it's not > limited by the performance of numerical calculation but rather that of > string processing and counting. TfidfVectorizer is equivalent to > CountVectorizer + TfidfTransformer and the later has only a marginal > computational cost. As to CountVectorizer, last time I checked, its > profiling was something along the lines of, > - part regexp for tokenization (see token_pattern.findall) > - part token counting (see CountVectorizer._count_vocab) > - and a comparable part for all the rest > > Because of that, porting it to Cython is not that immediate, as one is > still going to use CPython regexp and token counting in a dict. For > instance, HashingVectorizer implements token counting in Cython -- it's > faster but not that much faster. Using C++ maps or some less common > structures have been discussed in > https://github.com/scikit-learn/scikit-learn/issues/2639 > > Currently, I think, there are ~3 main ways performance could be improved, > 1. Optimize the current implementation while remaining in Python. > Possible but IMO would require some effort, because there are not much > low hanging fruits left there. Though a new look would definitely be good. > > 2. Parallelize computations. There was some earlier discussion about > this in scikit-learn issues, but at present, the better way would > probably be to add it in dask-ml (see > https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already > supported. Someone would need to implement CountVectorizer. > > 3. Rewrite part of the implementation in a lower level language (e.g. > Cython). The question is how maintainable that would be, and whether the > performance gains would be worth it. Now that Python 2 will be dropped, > at least not having to deal with Py2/3 compatibility for strings in > Cython might make things a bit easier. Though, if the processing is in > Cython it might also make using custom tokenizers/analyzers more difficult. > > On a related topic, I have been experimenting with implementing part > of this processing in Rust lately: > https://github.com/rth/text-vectorize. So far it looks promising. > Though, of course, it will remain a separate project because of language > constraints in scikit-learn. > > In general if you have thoughts on things that can be improved, don't > hesitate to open issues, From ea_azzoug at esi.dz Mon Nov 26 11:21:25 2018 From: ea_azzoug at esi.dz (AZZOUG Aghiles) Date: Mon, 26 Nov 2018 17:21:25 +0100 Subject: [scikit-learn] Contrib: Artificial Immune Recongnition System Message-ID: Hello devs, I'm a final year computer engineering student, currently doing my masters and engineering degree in recommender systems. Last summer, after an optimization course, I found a quite interesting recognition algorithm called : Artificial immune recognition system (described in the paper below), and I was wondering if its implementation would be interesting for the scikit-learn library. I wrote a first version of it which is available in my GitHub page ( https://github.com/AghilesAzzoug/Artificial-Immune-System ), the code is only working for Iris datasets (since it was only a test). I'm happy to get any suggestions of critics from the community. Sincerely, Aghiles Paper ref. : Watkins, A., Timmis, J., & Boggess, L. (2004). Artificial immune recognition system (AIRS): An immune-inspired supervised learning algorithm. *Genetic Programming and Evolvable Machines*, *5*(3), 291-317. -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at pm.me Mon Nov 26 15:07:24 2018 From: rth.yurchak at pm.me (Roman Yurchak) Date: Mon, 26 Nov 2018 20:07:24 +0000 Subject: [scikit-learn] Recurrent questions about speed for TfidfVectorizer In-Reply-To: <46dd3561-a70c-ea18-282f-26d34b87cf06@gmail.com> References: <46dd3561-a70c-ea18-282f-26d34b87cf06@gmail.com> Message-ID: Tries are interesting, but it appears that while they use less memory that dicts/maps they are generally slower than dicts for a large number of elements. See e.g. https://github.com/pytries/marisa-trie/blob/master/docs/benchmarks.rst. This is also consistent with the results in the below linked CountVectorizer PR that aimed to use tries, I think. Though maybe e.g. MARISA-Trie (and generally trie libraries available in python) did improve significantly in 5 years since https://github.com/scikit-learn/scikit-learn/issues/2639 was done. The thing is also that even HashingVecorizer that doesn't need to handle the vocabulary is only a moderately faster, so using a better data structure for the vocabulary might give us its performance at best.. -- Roman On 26/11/2018 16:f28, Andreas Mueller wrote: > I think tries might be an interesting datastructure, but it really > depends on where the bottleneck is. > I'm really surprised they are not used more, but maybe that's just > because implementations are missing? > > On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote: >> Hi Matthieu, >> >> if you are interested in general questions regarding improving >> scikit-learn performance, you might be want to have a look at the draft >> roadmap >> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- >> there is a lot topics where suggestions / PRs on improving performance >> would be very welcome. >> >> For the particular case of TfidfVectorizer, it is a bit different from >> the rest of the scikit-learn code base in the sense that it's not >> limited by the performance of numerical calculation but rather that of >> string processing and counting. TfidfVectorizer is equivalent to >> CountVectorizer + TfidfTransformer and the later has only a marginal >> computational cost. As to CountVectorizer, last time I checked, its >> profiling was something along the lines of, >> - part regexp for tokenization (see token_pattern.findall) >> - part token counting (see CountVectorizer._count_vocab) >> - and a comparable part for all the rest >> >> Because of that, porting it to Cython is not that immediate, as one is >> still going to use CPython regexp and token counting in a dict. For >> instance, HashingVectorizer implements token counting in Cython -- it's >> faster but not that much faster. Using C++ maps or some less common >> structures have been discussed in >> https://github.com/scikit-learn/scikit-learn/issues/2639 >> >> Currently, I think, there are ~3 main ways performance could be improved, >> 1. Optimize the current implementation while remaining in Python. >> Possible but IMO would require some effort, because there are not much >> low hanging fruits left there. Though a new look would definitely be good. >> >> 2. Parallelize computations. There was some earlier discussion about >> this in scikit-learn issues, but at present, the better way would >> probably be to add it in dask-ml (see >> https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already >> supported. Someone would need to implement CountVectorizer. >> >> 3. Rewrite part of the implementation in a lower level language (e.g. >> Cython). The question is how maintainable that would be, and whether the >> performance gains would be worth it. Now that Python 2 will be dropped, >> at least not having to deal with Py2/3 compatibility for strings in >> Cython might make things a bit easier. Though, if the processing is in >> Cython it might also make using custom tokenizers/analyzers more difficult. >> >> On a related topic, I have been experimenting with implementing part >> of this processing in Rust lately: >> https://github.com/rth/text-vectorize. So far it looks promising. >> Though, of course, it will remain a separate project because of language >> constraints in scikit-learn. >> >> In general if you have thoughts on things that can be improved, don't >> hesitate to open issues, > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From t3kcit at gmail.com Tue Nov 27 11:40:10 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 27 Nov 2018 11:40:10 -0500 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.1 released Message-ID: <8b363527-ac56-4a3b-c1f4-e0b8093f7cd1@gmail.com> Hey Everybody. I'm happy to announce that we released scikit-learn 0.20.1. This is a minor release containing mostly bugfixes and small improvements, though it's probably one of the bigger minor releases we've done. In particular there've been several enhancements to the ColumnTransformer, fetch_openml and parallelization with joblib. Another big change is that starting with this version we suggest users don't use the version of joblib that we include in scikit-learn and rather use joblib directly. You can find the full release notes here: https://scikit-learn.org/stable/whats_new.html#version-0-20-1 A big thank you to everybody who contributed! Best, Andy From t3kcit at gmail.com Tue Nov 27 11:40:17 2018 From: t3kcit at gmail.com (Andreas Mueller) Date: Tue, 27 Nov 2018 11:40:17 -0500 Subject: [scikit-learn] [ANN] Scikit-learn 0.20.1 released Message-ID: Hey Everybody. I'm happy to announce that we released scikit-learn 0.20.1. This is a minor release containing mostly bugfixes and small improvements, though it's probably one of the bigger minor releases we've done. In particular there've been several enhancements to the ColumnTransformer, fetch_openml and parallelization with joblib. Another big change is that starting with this version we suggest users don't use the version of joblib that we include in scikit-learn and rather use joblib directly. You can find the full release notes here: https://scikit-learn.org/stable/whats_new.html#version-0-20-1 A big thank you to everybody who contributed! Best, Andy