[scikit-learn] GSoC 2017 : "Parallel Decision Tree Building"

Aman Pratik amanpratik10 at gmail.com
Mon Mar 27 01:44:42 EDT 2017


I will be occupied with my tests for a couple of days, will get back with
the changes as soon as possible.

In the Gaussian Process parallelization there was an error while using the
multiprocessing backend, which couldn't be solved by simple changes in the
code. Hence we had to drop the idea for the time being.




On 27 March 2017 at 09:03, Jacob Schreiber <jmschreiber91 at gmail.com> wrote:

> Hi Aman
>
> Thanks for the updates, it looks more complete now.
>
> I don't see what the benefit is of considering three different parallelism
> techniques. I'm not sure how you would do sample parallelism given that you
> need to sort all of the samples-- maybe a merge sort? That doesn't seem the
> most efficient manner of parallelization, I'd stick only to parallelism
> across features as you can get  a great deal of efficiency out of doing
> that. It also makes the problem more managable.
>
> I would also focus your application more specifically on what parts of the
> code you will need to change and less conceptual. There is already a loop
> to consider features sequentially and identify the best one. The change is
> basically to parallelize this in the best manner given the other code.
> However, if the solution were as easy as changing the for loop to a
> Parallel()( delayed ) type schema we would have done it already. You should
> specify what the challenges will be, and why it isn't just as simple as
> that. Specifically focus on what goes on in the criterion class to make it
> more difficult.
>
> I also checked out your gaussian process parallelization. It looked like
> it wasn't speeding anything up because you were using a threading backend
> for a python function. You can only use the threading backend with a cython
> function where you also release the GIL, otherwise it won't help. Have you
> tried using the multiprocessing backend? That would likely be easier.
>
> Jacob
>
> On Sun, Mar 26, 2017 at 10:31 AM, Aman Pratik <amanpratik10 at gmail.com>
> wrote:
>
>> Hello Jacob,
>> This is my second draft for the proposal,
>>
>> Proposal : Second Draft
>> <https://github.com/amanp10/scikit-learn/wiki/GSoC-2017-:-Parallel-Decision-Tree-Building>
>>
>> It is incomplete in some places, related to detailing etc. I will need
>> little more time for that. Meanwhile, I await your feedback and guidance.
>>
>> Thank You
>>
>>
>>
>> On 23 March 2017 at 02:38, Jacob Schreiber <jmschreiber91 at gmail.com>
>> wrote:
>>
>>> Hi Aman
>>>
>>> Likely the easiest way to parallelize decision tree building is to
>>> parallelize the finding of the best split at each node, as it checks every
>>> non-constant feature for the best split. Several other approaches focus on
>>> how to parallelize tree building in the streaming or distributed cases,
>>> which we are not interested in at the moment (though partially fitting
>>> decision trees is a good separate project).
>>>
>>> As I mentioned in the github issue, it is likely easier to focus on this
>>> single issue for GSoC as opposed to making it distinct from the multiclass
>>> prediction, as this will provide similar speedups either way but be more
>>> general.
>>>
>>> It'd be great if you could add your experience directly to the gist and
>>> perhaps links to prior work if you have any of those.
>>>
>>> Something major missing from this is a proposed timeline. Several
>>> projects fail because they are overly ambitious or not well managed
>>> time-wise. Showing a timeline will help us manage the project later on, and
>>> ensure that you're aware of what the steps of the project will be.
>>>
>>> Thanks for the effort so far! Let me know when you've made updates.
>>>
>>> Jacob
>>>
>>> On Wed, Mar 22, 2017 at 12:55 AM, Aman Pratik <amanpratik10 at gmail.com>
>>> wrote:
>>>
>>>> Hello Developers,
>>>>
>>>> This is Aman Pratik. I am currently pursuing my B.Tech from Indian
>>>> Institute of Technology, Varanasi. After doing some research I have found
>>>> some material on Decision Trees and Parallelization. Hence, I propose my
>>>> first draft for the project "Parallel Decision Tree Building" for GSoC 2017.
>>>>
>>>> Proposal : First Draft
>>>> <https://github.com/amanp10/scikit-learn/wiki/GSoC-2017-:-Parallel-Decision-Tree-Building>
>>>>
>>>> Why me?
>>>>
>>>> I have been working in Python for the past 2 years and have good idea
>>>> about Machine Learning algorithms. I am quite familiar with scikit-learn
>>>> both as a user and a developer.
>>>>
>>>> These are the issues/PRs I have worked/working on for the past few
>>>> months.
>>>>
>>>> [MRG+1] Issue#5803 : Regression Test added #8112
>>>> <https://github.com/scikit-learn/scikit-learn/pull/8112>
>>>>
>>>> [MRG] Issue#6673:Make a wrapper around functions that score an
>>>> individual feature #8038
>>>> <https://github.com/scikit-learn/scikit-learn/pull/8038>
>>>>
>>>> [MRG] Issue #7987: Embarrassingly parallel "n_restarts_optimizer" in
>>>> GaussianProcessRegressor #7997
>>>> <https://github.com/scikit-learn/scikit-learn/pull/7997>
>>>>
>>>> My GitHub Profile: amanp10 <https://www.github.com/amanp10>
>>>>
>>>> I have worked with parallelization in one of my PR, so I am not new to
>>>> it. I have used cython a couple of times, though as a beginner. I have not
>>>> used Decision Tree much but I am familiar with the theory and algorithm.
>>>> Also, I am familiar with Benchmark tests, Unit tests and other technical
>>>> knowledge I would require for this project.
>>>>
>>>> Meanwhile, I have started my study for the subject and gaining
>>>> experience with Cython. I am looking forward to guidance from the potential
>>>> mentors or anyone willing to help.
>>>>
>>>> Thank You
>>>>
>>>>
>>>> _______________________________________________
>>>> scikit-learn mailing list
>>>> scikit-learn at python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn at python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170327/21ebf4eb/attachment-0001.html>


More information about the scikit-learn mailing list