From jk2k.net at gmail.com Sat Oct 2 11:25:25 2021 From: jk2k.net at gmail.com (J K) Date: Sat, 2 Oct 2021 11:25:25 -0400 Subject: [scikit-learn] ROC convex hulls design question Message-ID: <3CC1A986-E296-430E-86B9-EF9B36E0A3DC@gmail.com> Dear sklearn mailing list, I love all the wonderful ways scikit-learn has made good practices in ML more accessible to so many! Thanks for all of that! I?m wondering if there is there a design reason the default behavior for ROC generation (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html) doesn?t return the convex hull of the ROC? In the default ROC computation, the resulting ROCs aren?t on their convex hulls (https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.ConvexHull.html) even though points on the convex hulls are achievable performance. So the default ROCs returned are suboptimal. That?s a point made in Tom Fawcett?s ROC 101 paper (https://www.math.ucdavis.edu/~saito/data/roc/fawcett-roc.pdf) that was cited in the sklearn docs. He writes: ?More generally, a classifier is potentially optimal if and only if it lies on the convex hull of the set of points in ROC space. The convex hull of the set of points in ROC space is called the ROC convex hull (ROCCH) of the corresponding set of classifiers.? Apologies if this is already answered somewhere else? I searched and could only find this apparently abandoned repo: https://github.com/tfawcett/pycost I?ve implemented an ROC convex hull myself and have found significant performance estimate improvements just from using the convex hull and am wondering if there was some reason this wasn?t implemented as the default. Thanks, -johnk- -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Tue Oct 5 10:09:48 2021 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Tue, 5 Oct 2021 16:09:48 +0200 Subject: [scikit-learn] [TC Vote] Technical Committee vote: line length In-Reply-To: <482f3b2c-fcff-719b-aa44-6f3c2d4afc0b@gmail.com> References: <20210726212619.54iy56wbl4sdbe3z@phare.normalesup.org> <482f3b2c-fcff-719b-aa44-6f3c2d4afc0b@gmail.com> Message-ID: <20211005140948.kjyj35omefzeppcu@phare.normalesup.org> Hi everyone, I left for vacations and forgot this (and did not express my vote). The TC has had plenty of time to vote, my own vote is in favor of the consensus in very active developers. My count of the expressed vote is the following: - Keep current 88 characters: Olivier Grisel Joel Nothman Ga?l Varoquaux - Revert to 79 characters: Alex Gramfort Adrin Jalali - Answer with no preference expressed: Roman Yurchak So the decision is to use 88 chars, which means no action is needed. Thank you everyone! Ga?l On Mon, Aug 02, 2021 at 11:15:48AM +0200, Roman Yurchak wrote: > I also don't have a strong opinion on this, and generally I'm just happy > that black migration happened. > Still with a slight preference for 88 characters as the default. > On 28/07/2021 18:34, Olivier Grisel wrote: > > Many very active core devs not represented in the TC voted for 88 and > > my previous vote for 79 was not that strong. So I feel that I should > > now vote for 88: > > Keep current 88 characters: > > Olivier > > Revert to 79 characters: > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -- Gael Varoquaux Research Director, INRIA http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From olivier.grisel at ensta.org Wed Oct 6 10:42:14 2021 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Wed, 6 Oct 2021 16:42:14 +0200 Subject: [scikit-learn] scikit-learn office hours on Friday Oct. 8 2021 Message-ID: Hi all, Some of us will be online on the scikit-learn discord this Friday at 15:00 UTC and 20:00 UTC. First time and occasional contributors are welcome to join us to discord using this invitation link: https://discord.gg/YBdN45kD The focus of these office hour sessions is to answer questions about contributing to scikit-learn. We can also split into break out audio/text channels and do pair programming or live reviewing of forgotten pull requests with screen sharing. We can also try to assist you into crafting minimal reproduction cases for bug reports to get a higher likelihood of resolution (e.g. https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports). If this experiment is successful, we will probably hold this kind of office hours on a regular basis. See you soon on discord! -- Olivier From g.lemaitre58 at gmail.com Fri Oct 8 10:21:55 2021 From: g.lemaitre58 at gmail.com (=?utf-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Fri, 8 Oct 2021 16:21:55 +0200 Subject: [scikit-learn] scikit-learn office hours on Friday Oct. 8 2021 In-Reply-To: References: Message-ID: I see that Olivier did a small mistake. I will be have the office hours from 18:00 to 19:00 UTC. So there is no office hour from 19:00 to 20:00 UTC. Cheers, -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ > On 6 Oct 2021, at 16:42, Olivier Grisel wrote: > > Hi all, > > Some of us will be online on the scikit-learn discord this Friday at > 15:00 UTC and 20:00 UTC. > > First time and occasional contributors are welcome to join us to > discord using this invitation link: > > https://discord.gg/YBdN45kD > > The focus of these office hour sessions is to answer questions about > contributing to scikit-learn. We can also split into break out > audio/text channels and do pair programming or live reviewing of > forgotten pull requests with screen sharing. > > We can also try to assist you into crafting minimal reproduction cases > for bug reports to get a higher likelihood of resolution (e.g. > https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports). > > If this experiment is successful, we will probably hold this kind of > office hours on a regular basis. > > See you soon on discord! > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn -------------- next part -------------- An HTML attachment was scrubbed... URL: From olivier.grisel at ensta.org Fri Oct 8 10:52:35 2021 From: olivier.grisel at ensta.org (Olivier Grisel) Date: Fri, 8 Oct 2021 16:52:35 +0200 Subject: [scikit-learn] scikit-learn office hours on Friday Oct. 8 2021 In-Reply-To: References: Message-ID: To summarize, the office hours for today are: - 15:00-16:00 UTC / 17:00-18:00 CEST (this one starts in less than 10min) - 18:00-19:00 UTC / 20:00-21:00 CEST (with Guillaume) Sorry for the confusion and see you soon. -- Olivier From reshama.stat at gmail.com Mon Oct 11 08:00:00 2021 From: reshama.stat at gmail.com (Reshama Shaikh) Date: Mon, 11 Oct 2021 08:00:00 -0400 Subject: [scikit-learn] [Data Umbrella] AFME (Africa & Middle East) scikit-learn open source sprint (scikit-learn) In-Reply-To: References: Message-ID: Hello, At this time, we have a few spots open for the upcoming October 23 online scikit-learn sprint organized by Data Umbrella. If you reside outside of the Africa and Middle East region, you are now able to apply. https://afme2021rc.dataumbrella.org/home Note 1: we offer a stipend of \$10 USD to cover the cost of internet access, and you can indicate such on your application. Note 2: if you need a translator, please indicate so on your application. Key Notes: a) There is a pre-sprint event on Saturday October 16 from 5-6pm EAT. This pre-sprint event is *optional* and an opportunity to answer any questions in general and aid in setting up your virtual environment. b) Sprint is on *Saturday, October 23 at 5pm - 9pm EAT (East Africa Time) *on our Discord server. c) There is a post-sprint event on Saturday November 23 from 5-6pm EAT. This post-sprint event is *optional* and an opportunity to ask the core devs questions on open pull requests. d) There is 3-4 hours of pre-work for the sprint. Here is the checklist: https://afme2021rc.dataumbrella.org/about/prep-work Please feel free to send any questions to me off the mailing list. Best, Reshama Reshama Shaikh she/her Blog | Twitter | LinkedIn | GitHub Data Umbrella NYC PyLadies On Sat, Sep 25, 2021 at 5:05 PM Reshama Shaikh wrote: > Hello, > > Data Umbrella is organizing a scikit-learn sprint for this October 23, > with a focus on **Africa and the Middle East**. This event is free. > > A sprint is a 4-hour hands-on hackathon where we work on beginner issues > in the scikit-learn GitHub repository. Participants will be paired with > another person. There will be core contributors available to answer any > questions. > > Event website is: https://afme2021rc.dataumbrella.org > We encourage folks to read the website and then complete the application. > > The event can be shared in these ways: > - Retweet: https://twitter.com/DataUmbrella/status/1435972074842034184 > - Share post on LinkedIn: > https://www.linkedin.com/feed/update/urn:li:activity:6841738994305294336/ > > Please feel free to contact me if you have any questions. > > Cheers, > Reshama Shaikh > she/her > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reshama.stat at gmail.com Mon Oct 11 08:00:00 2021 From: reshama.stat at gmail.com (Reshama Shaikh) Date: Mon, 11 Oct 2021 08:00:00 -0400 Subject: [scikit-learn] Open Source: sustainability and etiquette In-Reply-To: References: Message-ID: Hello, Adding another resource to this page entitled "Open Source Sustainability". [a] Keynote presentation [b] (video is 30 minutes) by a professor at Carnegie Mellon University. Her talk title: *Laura Dabbish- Diversity and inclusion in open source digital infrastructure projects* She discusses her research into open source and key takeaways. [a] https://www.dataumbrella.org/open-source/open-source-sustainability [b] https://youtu.be/h6OkCbEd1AE --- Reshama Shaikh she/her On Thu, Aug 5, 2021 at 10:38 AM Reshama Shaikh wrote: > Hello, > I found the video, it's from 2017. It's by Heather Miller, a professor at > CMU. The 40-minute talk is entitled: The Dramatic Consequences of the > Open Source Revolution [a] > > Brigitta, > Heather references Nadia Eghbal's book in her talk, which I also added to > my list. [b] > > Adrin, > I added CHAOSS to the list as well. They have a mailing list which I have > subscribed to. > > [a] https://youtu.be/K4mVuxcimWk > [b] https://www.dataumbrella.org/open-source/open-source-sustainability > > > Reshama Shaikh > she/her > Blog | Twitter > | LinkedIn | GitHub > > > Data Umbrella > NYC PyLadies > > > > On Mon, Apr 19, 2021 at 6:51 PM Brigitta Sipocz wrote: > >> Hi, >> >> I've also very much liked Nadia Eghbal's book: Working in public; The >> making and maintenance of open source software. I haven't yet attended a >> conference where she was a speaker, but I'm certain there are some relevant >> recordings on youtube. >> >> Cheers, >> Brigitta >> >> >> On Mon, 19 Apr 2021 at 06:27, Adrin wrote: >> >>> This is a really good initiative Reshama, thanks for sharing. >>> >>> Have you seen CHAOSScon talks and activities? They're really good, and >>> touch on a lot of really good stuff when it comes to open source >>> communities and sustainability. >>> Eg.: https://chaoss.community/chaosscon-2020-eu/ >>> >>> Cheers, >>> Adrin >>> >>> On Fri, Apr 16, 2021 at 4:26 PM Reshama Shaikh >>> wrote: >>> >>>> Hello, >>>> I've seen some excellent resources that have explained open source, its >>>> sustainability, challenges and *indirectly, the etiquette*. >>>> >>>> I am starting to compile the list here [a]. >>>> >>>> This keynote by Stuart Geiger is a must-watch: The Invisible Work of >>>> Maintaining & Sustaining Open Source Software [b] >>>> >>>> There is one more video by Emily someone who was at Microsoft, but is >>>> now a professor somewhere, and I am trying to track that video down. I >>>> think it's from 2017. I'll add it to the list once I find it. If anyone >>>> knows the full name of the speaker, please share. >>>> >>>> [a] >>>> https://www.dataumbrella.org/open-source/open-source-sustainability >>>> >>>> [b] >>>> https://www.youtube.com/watch?v=PM3iltcaIL8 >>>> >>>> Best, >>>> Reshama >>>> --- >>>> Reshama Shaikh >>>> she/her >>>> Blog | Twitter >>>> | LinkedIn >>>> | GitHub >>>> >>>> >>>> Data Umbrella >>>> NYC PyLadies >>>> >>>> _______________________________________________ >>>> scikit-learn mailing list >>>> scikit-learn at python.org >>>> https://mail.python.org/mailman/listinfo/scikit-learn >>>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn at python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn at python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gael.varoquaux at normalesup.org Wed Oct 13 10:40:50 2021 From: gael.varoquaux at normalesup.org (Gael Varoquaux) Date: Wed, 13 Oct 2021 16:40:50 +0200 Subject: [scikit-learn] DirtyData and the SuperVectorizer, for non-normalized dataframes Message-ID: <20211013144050.ssikmlt6gz6u4ijy@phare.normalesup.org> Dear scikit-learn community, I would like to announce a new release of dirty-cat, which strives to facilitates machine-learning on non-curated categories: robust to morphological variants, such as typos. The new big feature, which I think is of interest to many, is the "SuperVectorizer", that strives to readily vectorize a pandas dataframe: https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#example-super-vectorizer Of course, such an object is full of heuristics. We have tuned them empirically, but we expect more progress in the long term, as we build a bigger databases of dataframes that are difficult to vectorize. We'd love people to join the adventure, it's been fun so far. Cheers, Ga?l -- Gael Varoquaux Research Director, INRIA http://gael-varoquaux.info http://twitter.com/GaelVaroquaux From acojugo at gmail.com Thu Oct 14 08:29:23 2021 From: acojugo at gmail.com (Aco Jugo) Date: Thu, 14 Oct 2021 14:29:23 +0200 Subject: [scikit-learn] (no subject) Message-ID: -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomasjpfan at gmail.com Mon Oct 18 12:09:09 2021 From: thomasjpfan at gmail.com (Thomas J. Fan) Date: Mon, 18 Oct 2021 12:09:09 -0400 Subject: [scikit-learn] scikit-learn monthly developer meeting: Monday October 25th 2021 Message-ID: Dear all, The scikit-learn developer monthly meeting will take place on Monday October 25th at 1PM UTC. - Video call link: https://meet.google.com/ews-uszu-djs - Meeting notes / agenda: https://hackmd.io/0yokz72CTZSny8y3Re648Q - Local times: https://www.timeanddate.com/worldclock/meetingdetails.html?year=2021&month=10&day=25&hour=13&min=0&sec=0&p1=1440&p2=240&p3=248&p4=195&p5=179&p6=224 The goal of this meeting is to discuss ongoing development topics for the project. Everybody is welcome. As usual, please follow the code of conduct of the project: https://github.com/scikit-learn/scikit-learn/blob/main/CODE_OF_CONDUCT.md Regards, Thomas -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Mon Oct 25 07:25:33 2021 From: g.lemaitre58 at gmail.com (=?UTF-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Mon, 25 Oct 2021 13:25:33 +0200 Subject: [scikit-learn] [ANN] scikit-learn 1.0.1 is online! Message-ID: scikit-learn 1.0.1 is out on pypi.org and conda-forge! This is a small maintenance release that fixes a couple of regressions: https://scikit-learn.org/dev/whats_new/v1.0.html#version-1-0-1 You can upgrade with pip as usual: pip install -U scikit-learn The conda-forge builds will be available shortly, which you can then install using: conda install -c conda-forge scikit-learn Thanks again to all the contributors. On behalf of the scikit-learn maintainer team. -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From thomasjpfan at gmail.com Thu Oct 28 18:06:10 2021 From: thomasjpfan at gmail.com (Thomas J. Fan) Date: Thu, 28 Oct 2021 18:06:10 -0400 Subject: [scikit-learn] scikit-learn office hours on Friday Oct. 29 2021 Message-ID: Hi all, Some of us will be online on the scikit-learn discord this Friday at 11am ET / 15:00 UTC / 17:00 CEST. First time and occasional contributors are welcome to join us to discord using this invitation link: https://discord.gg/YBdN45kD The focus of these office hour sessions is to answer questions about contributing to scikit-learn. We can also split into break out audio/text channels and do pair programming or live reviewing of forgotten pull requests with screen sharing. We can also try to assist you into crafting minimal reproduction cases for bug reports to get a higher likelihood of resolution (e.g. https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports). Please note, our Code of Conduct applies: https://github.com/scikit-learn/scikit-learn/blob/main/CODE_OF_CONDUCT.md If this experiment is successful, we will probably hold this kind of office hours on a regular basis. See you soon on discord! -- Thomas -------------- next part -------------- An HTML attachment was scrubbed... URL: From g.lemaitre58 at gmail.com Sat Oct 30 05:18:41 2021 From: g.lemaitre58 at gmail.com (=?utf-8?Q?Guillaume_Lema=C3=AEtre?=) Date: Sat, 30 Oct 2021 11:18:41 +0200 Subject: [scikit-learn] New core dev: Julien Jerphanion Message-ID: The scikit-learn core development team has welcomed a new member, Julien Jerphanion, who has contributed code, reviews, and documentation since this March (aside from occasional contributions in the past). Congratulation and welcome Julien! On the behalf of the scikit-learn team -- Guillaume Lemaitre Scikit-learn @ Inria Foundation https://glemaitre.github.io/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From rth.yurchak at gmail.com Sat Oct 30 05:56:02 2021 From: rth.yurchak at gmail.com (Roman Yurchak) Date: Sat, 30 Oct 2021 11:56:02 +0200 Subject: [scikit-learn] New core dev: Julien Jerphanion In-Reply-To: References: Message-ID: Congratulations, Julian, and thank for all your work! Roman On 30/10/2021 11:18, Guillaume Lema?tre wrote: > The scikit-learn core development team has welcomed a new member, Julien > Jerphanion, who has contributed code, reviews, and documentation since > this March (aside from occasional contributions in the past). > > Congratulation and welcome Julien! > > On the behalf of the scikit-learn team > -- > Guillaume Lemaitre > Scikit-learn @ Inria Foundation > https://glemaitre.github.io/ > > > _______________________________________________ > scikit-learn mailing list > scikit-learn at python.org > https://mail.python.org/mailman/listinfo/scikit-learn > From matematica.a3k at gmail.com Sun Oct 31 13:07:57 2021 From: matematica.a3k at gmail.com (=?UTF-8?Q?Matem=C3=A1tica_A3K?=) Date: Sun, 31 Oct 2021 14:07:57 -0300 Subject: [scikit-learn] Help interpreting decision function plot Message-ID: Hi! I have been building a tool that integrates statistical engines - specially scikit-learn - with django called django-ai . With that tool, I have built another, covid-ht , which should showcase the power of those together. That tool is meant to help health professionals with classification tasks based on measurements . The tool is heading to its first release as a technology preview, and in this process I have faced a release-blocker issue for which I would like to ask for your help: I can't find a consistent interpretation of the graphs. The graphs are called "conditional decision functions ", where each one is the contour of the decision function of a classifier for an observation in 2 variables while leaving the others fixed. The graphs show classification regions as expected, but my initial interpretation seems wrong (commented out ). If that explanation was good, I would expect that perturbing one variable in a direction where the graph shows another class should switch the classification, as the remaining variables are fixed and that should be the value that the classifier uses to decide - which is plotted in that plane. That is not happening, as you may check here (the classifier being used is an Histogram-based Gradient Boosting Classification Tree). Any insight about the situation will be highly appreciated and thankful in advance. -------------- next part -------------- An HTML attachment was scrubbed... URL: