<div dir="ltr"><img width="0" height="0" class="mailtrack-img" src="https://mailtrack.io/trace/mail/edda9cc05f8f9a01bbe03b042354286969a54f9f.png?u=1387996">I will be occupied with my tests for a couple of days, will get back with the changes as soon as possible.<div><br></div><div>In the Gaussian Process parallelization there was an error while using the multiprocessing backend, which couldn't be solved by simple changes in the code. Hence we had to drop the idea for the time being.<br><br><br><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 27 March 2017 at 09:03, Jacob Schreiber <span dir="ltr"><<a href="mailto:jmschreiber91@gmail.com" target="_blank">jmschreiber91@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Aman<div><br></div><div>Thanks for the updates, it looks more complete now.</div><div><br></div><div>I don't see what the benefit is of considering three different parallelism techniques. I'm not sure how you would do sample parallelism given that you need to sort all of the samples-- maybe a merge sort? That doesn't seem the most efficient manner of parallelization, I'd stick only to parallelism across features as you can get a great deal of efficiency out of doing that. It also makes the problem more managable.</div><div><br></div><div>I would also focus your application more specifically on what parts of the code you will need to change and less conceptual. There is already a loop to consider features sequentially and identify the best one. The change is basically to parallelize this in the best manner given the other code. However, if the solution were as easy as changing the for loop to a Parallel()( delayed ) type schema we would have done it already. You should specify what the challenges will be, and why it isn't just as simple as that. Specifically focus on what goes on in the criterion class to make it more difficult.</div><div><br></div><div>I also checked out your gaussian process parallelization. It looked like it wasn't speeding anything up because you were using a threading backend for a python function. You can only use the threading backend with a cython function where you also release the GIL, otherwise it won't help. Have you tried using the multiprocessing backend? That would likely be easier.</div><span class="HOEnZb"><font color="#888888"><div><br></div><div>Jacob</div></font></span></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Mar 26, 2017 at 10:31 AM, Aman Pratik <span dir="ltr"><<a href="mailto:amanpratik10@gmail.com" target="_blank">amanpratik10@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><img width="0" height="0" class="m_5021375416609195106m_7793438846393465852mailtrack-img" src="">Hello Jacob,<br>This is my second draft for the proposal,<div><br></div><div><a href="https://github.com/amanp10/scikit-learn/wiki/GSoC-2017-:-Parallel-Decision-Tree-Building" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">Proposal : Second Draft</a></div><div><br></div><div>It is incomplete in some places, related to detailing etc. I will need little more time for that. Meanwhile, I await your feedback and guidance.</div><div><br></div><div>Thank You<br><br><br></div><div><div class="m_5021375416609195106h5"><div class="gmail_extra"><br><div class="gmail_quote">On 23 March 2017 at 02:38, Jacob Schreiber <span dir="ltr"><<a href="mailto:jmschreiber91@gmail.com" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">jmschreiber91@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi Aman<div><br></div><div>Likely the easiest way to parallelize decision tree building is to parallelize the finding of the best split at each node, as it checks every non-constant feature for the best split. Several other approaches focus on how to parallelize tree building in the streaming or distributed cases, which we are not interested in at the moment (though partially fitting decision trees is a good separate project).</div><div><br></div><div>As I mentioned in the github issue, it is likely easier to focus on this single issue for GSoC as opposed to making it distinct from the multiclass prediction, as this will provide similar speedups either way but be more general.</div><div><br></div><div>It'd be great if you could add your experience directly to the gist and perhaps links to prior work if you have any of those.</div><div><br></div><div>Something major missing from this is a proposed timeline. Several projects fail because they are overly ambitious or not well managed time-wise. Showing a timeline will help us manage the project later on, and ensure that you're aware of what the steps of the project will be.</div><div><br></div><div>Thanks for the effort so far! Let me know when you've made updates.</div><div><br></div><div>Jacob</div></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="m_5021375416609195106m_7793438846393465852h5">On Wed, Mar 22, 2017 at 12:55 AM, Aman Pratik <span dir="ltr"><<a href="mailto:amanpratik10@gmail.com" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">amanpratik10@gmail.com</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="m_5021375416609195106m_7793438846393465852h5"><div dir="ltr"><img width="0" height="0" class="m_5021375416609195106m_7793438846393465852m_5233505920777821751m_-1946093037058396255mailtrack-img"><div>Hello Developers,</div><div><br></div><div>This is Aman Pratik. I am currently pursuing my B.Tech from Indian Institute of Technology, Varanasi. After doing some research I have found some material on Decision Trees and Parallelization. Hence, I propose my first draft for the project "Parallel Decision Tree Building" for GSoC 2017.</div><div><br></div><div><a href="https://github.com/amanp10/scikit-learn/wiki/GSoC-2017-:-Parallel-Decision-Tree-Building" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">Proposal : First Draft</a><br></div><div><br></div><div><div>Why me?</div><div><br></div><div>I have been working in Python for the past 2 years and have good idea about Machine Learning algorithms. I am quite familiar with scikit-learn both as a user and a developer.</div><div><br></div><div>These are the issues/PRs I have worked/working on for the past few months.</div><div><br></div><div><a href="https://github.com/scikit-learn/scikit-learn/pull/8112" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">[MRG+1] Issue#5803 : Regression Test added #8112</a></div><div><br></div><div><a href="https://github.com/scikit-learn/scikit-learn/pull/8038" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">[MRG] Issue#6673:Make a wrapper around functions that score an individual feature #8038</a></div><div><br></div><div><a href="https://github.com/scikit-learn/scikit-learn/pull/7997" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">[MRG] Issue #7987: Embarrassingly parallel "n_restarts_optimizer" in GaussianProcessRegressor #7997</a></div><div><br></div><div>My GitHub Profile: <a href="https://www.github.com/amanp10" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">amanp10</a></div><div><br></div><div>I have worked with parallelization in one of my PR, so I am not new to it. I have used cython a couple of times, though as a beginner. I have not used Decision Tree much but I am familiar with the theory and algorithm. Also, I am familiar with Benchmark tests, Unit tests and other technical knowledge I would require for this project.</div><div><br></div><div>Meanwhile, I have started my study for the subject and gaining experience with Cython. I am looking forward to guidance from the potential mentors or anyone willing to help.</div><div><br></div><div>Thank You</div></div><br></div>
<br></div></div>______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>
<br></blockquote></div><br></div>
<br>______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" class="m_5021375416609195106m_7793438846393465852mt-detrack-inspected" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>
<br></blockquote></div><br></div></div></div></div>
<br>______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>
<br></blockquote></div><br></div>
</div></div><br>______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/<wbr>mailman/listinfo/scikit-learn</a><br>
<br></blockquote></div><br></div>