[scikit-learn] GSoC 2017 : "Parallel Decision Tree Building"

Jacob Schreiber jmschreiber91 at gmail.com
Sun Mar 26 23:33:33 EDT 2017


Hi Aman

Thanks for the updates, it looks more complete now.

I don't see what the benefit is of considering three different parallelism
techniques. I'm not sure how you would do sample parallelism given that you
need to sort all of the samples-- maybe a merge sort? That doesn't seem the
most efficient manner of parallelization, I'd stick only to parallelism
across features as you can get  a great deal of efficiency out of doing
that. It also makes the problem more managable.

I would also focus your application more specifically on what parts of the
code you will need to change and less conceptual. There is already a loop
to consider features sequentially and identify the best one. The change is
basically to parallelize this in the best manner given the other code.
However, if the solution were as easy as changing the for loop to a
Parallel()( delayed ) type schema we would have done it already. You should
specify what the challenges will be, and why it isn't just as simple as
that. Specifically focus on what goes on in the criterion class to make it
more difficult.

I also checked out your gaussian process parallelization. It looked like it
wasn't speeding anything up because you were using a threading backend for
a python function. You can only use the threading backend with a cython
function where you also release the GIL, otherwise it won't help. Have you
tried using the multiprocessing backend? That would likely be easier.

Jacob

On Sun, Mar 26, 2017 at 10:31 AM, Aman Pratik <amanpratik10 at gmail.com>
wrote:

> Hello Jacob,
> This is my second draft for the proposal,
>
> Proposal : Second Draft
> <https://github.com/amanp10/scikit-learn/wiki/GSoC-2017-:-Parallel-Decision-Tree-Building>
>
> It is incomplete in some places, related to detailing etc. I will need
> little more time for that. Meanwhile, I await your feedback and guidance.
>
> Thank You
>
>
>
> On 23 March 2017 at 02:38, Jacob Schreiber <jmschreiber91 at gmail.com>
> wrote:
>
>> Hi Aman
>>
>> Likely the easiest way to parallelize decision tree building is to
>> parallelize the finding of the best split at each node, as it checks every
>> non-constant feature for the best split. Several other approaches focus on
>> how to parallelize tree building in the streaming or distributed cases,
>> which we are not interested in at the moment (though partially fitting
>> decision trees is a good separate project).
>>
>> As I mentioned in the github issue, it is likely easier to focus on this
>> single issue for GSoC as opposed to making it distinct from the multiclass
>> prediction, as this will provide similar speedups either way but be more
>> general.
>>
>> It'd be great if you could add your experience directly to the gist and
>> perhaps links to prior work if you have any of those.
>>
>> Something major missing from this is a proposed timeline. Several
>> projects fail because they are overly ambitious or not well managed
>> time-wise. Showing a timeline will help us manage the project later on, and
>> ensure that you're aware of what the steps of the project will be.
>>
>> Thanks for the effort so far! Let me know when you've made updates.
>>
>> Jacob
>>
>> On Wed, Mar 22, 2017 at 12:55 AM, Aman Pratik <amanpratik10 at gmail.com>
>> wrote:
>>
>>> Hello Developers,
>>>
>>> This is Aman Pratik. I am currently pursuing my B.Tech from Indian
>>> Institute of Technology, Varanasi. After doing some research I have found
>>> some material on Decision Trees and Parallelization. Hence, I propose my
>>> first draft for the project "Parallel Decision Tree Building" for GSoC 2017.
>>>
>>> Proposal : First Draft
>>> <https://github.com/amanp10/scikit-learn/wiki/GSoC-2017-:-Parallel-Decision-Tree-Building>
>>>
>>> Why me?
>>>
>>> I have been working in Python for the past 2 years and have good idea
>>> about Machine Learning algorithms. I am quite familiar with scikit-learn
>>> both as a user and a developer.
>>>
>>> These are the issues/PRs I have worked/working on for the past few
>>> months.
>>>
>>> [MRG+1] Issue#5803 : Regression Test added #8112
>>> <https://github.com/scikit-learn/scikit-learn/pull/8112>
>>>
>>> [MRG] Issue#6673:Make a wrapper around functions that score an
>>> individual feature #8038
>>> <https://github.com/scikit-learn/scikit-learn/pull/8038>
>>>
>>> [MRG] Issue #7987: Embarrassingly parallel "n_restarts_optimizer" in
>>> GaussianProcessRegressor #7997
>>> <https://github.com/scikit-learn/scikit-learn/pull/7997>
>>>
>>> My GitHub Profile: amanp10 <https://www.github.com/amanp10>
>>>
>>> I have worked with parallelization in one of my PR, so I am not new to
>>> it. I have used cython a couple of times, though as a beginner. I have not
>>> used Decision Tree much but I am familiar with the theory and algorithm.
>>> Also, I am familiar with Benchmark tests, Unit tests and other technical
>>> knowledge I would require for this project.
>>>
>>> Meanwhile, I have started my study for the subject and gaining
>>> experience with Cython. I am looking forward to guidance from the potential
>>> mentors or anyone willing to help.
>>>
>>> Thank You
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170326/30c75f12/attachment.html>


More information about the scikit-learn mailing list