[scikit-learn] GSoC 2017 : "Parallel Decision Tree Building"

Jacob Schreiber jmschreiber91 at gmail.com
Wed Mar 22 17:08:45 EDT 2017


Hi Aman

Likely the easiest way to parallelize decision tree building is to
parallelize the finding of the best split at each node, as it checks every
non-constant feature for the best split. Several other approaches focus on
how to parallelize tree building in the streaming or distributed cases,
which we are not interested in at the moment (though partially fitting
decision trees is a good separate project).

As I mentioned in the github issue, it is likely easier to focus on this
single issue for GSoC as opposed to making it distinct from the multiclass
prediction, as this will provide similar speedups either way but be more
general.

It'd be great if you could add your experience directly to the gist and
perhaps links to prior work if you have any of those.

Something major missing from this is a proposed timeline. Several projects
fail because they are overly ambitious or not well managed time-wise.
Showing a timeline will help us manage the project later on, and ensure
that you're aware of what the steps of the project will be.

Thanks for the effort so far! Let me know when you've made updates.

Jacob

On Wed, Mar 22, 2017 at 12:55 AM, Aman Pratik <amanpratik10 at gmail.com>
wrote:

> Hello Developers,
>
> This is Aman Pratik. I am currently pursuing my B.Tech from Indian
> Institute of Technology, Varanasi. After doing some research I have found
> some material on Decision Trees and Parallelization. Hence, I propose my
> first draft for the project "Parallel Decision Tree Building" for GSoC 2017.
>
> Proposal : First Draft
> <https://github.com/amanp10/scikit-learn/wiki/GSoC-2017-:-Parallel-Decision-Tree-Building>
>
> Why me?
>
> I have been working in Python for the past 2 years and have good idea
> about Machine Learning algorithms. I am quite familiar with scikit-learn
> both as a user and a developer.
>
> These are the issues/PRs I have worked/working on for the past few months.
>
> [MRG+1] Issue#5803 : Regression Test added #8112
> <https://github.com/scikit-learn/scikit-learn/pull/8112>
>
> [MRG] Issue#6673:Make a wrapper around functions that score an individual
> feature #8038 <https://github.com/scikit-learn/scikit-learn/pull/8038>
>
> [MRG] Issue #7987: Embarrassingly parallel "n_restarts_optimizer" in
> GaussianProcessRegressor #7997
> <https://github.com/scikit-learn/scikit-learn/pull/7997>
>
> My GitHub Profile: amanp10 <https://www.github.com/amanp10>
>
> I have worked with parallelization in one of my PR, so I am not new to it.
> I have used cython a couple of times, though as a beginner. I have not used
> Decision Tree much but I am familiar with the theory and algorithm. Also, I
> am familiar with Benchmark tests, Unit tests and other technical knowledge
> I would require for this project.
>
> Meanwhile, I have started my study for the subject and gaining experience
> with Cython. I am looking forward to guidance from the potential mentors or
> anyone willing to help.
>
> Thank You
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170322/561a1b9d/attachment.html>


More information about the scikit-learn mailing list