<div dir="auto">Thank you very much...<div dir="auto">I will try alternatives<br><br><div data-smartmail="gmail_signature" dir="auto">Sasha Kacanski </div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mar 16, 2017 8:28 AM, "Roman Yurchak" <<a href="mailto:rth.yurchak@gmail.com">rth.yurchak@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">If you run out of memory at the prediction step, splitting the test dataset in batches, then concatenating the results should work fine. Why would it "skew" the results?<br>
<br>
70GB RAM seems huge: for comparison here is some categorization benchmarks on a 700k text dataset, that use more in the order of 5-10 GB RAM,<br>
<a href="https://github.com/FreeDiscovery/FreeDiscovery/issues/58" rel="noreferrer" target="_blank">https://github.com/FreeDiscove<wbr>ry/FreeDiscovery/issues/58</a><br>
though with fairly short documents, for other algorithms and with a smaller training set.<br>
<br>
You could also try reducing the size of your dictionary with hashing.<br>
If you really want to use random forest and have memory constraints, you might want to use n_jobs=1 to avoid memory copies,<br>
<br>
<a href="https://www.quora.com/Why-is-scikit-learns-random-forest-using-so-much-memory" rel="noreferrer" target="_blank">https://www.quora.com/Why-is-s<wbr>cikit-learns-random-forest-usi<wbr>ng-so-much-memory</a><br>
<br>
But as Joel was saying, random forest might not the best choice for huge sparse arrays; NaiveBayes, LogisticRegression or SVM could be better suited, or gradient boosting if you want to go that way...<br>
<br>
<br>
On 16/03/17 02:44, Joel Nothman wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Trees are not a traditional choice for bag of words models, but you<br>
should make sure you are at least using the parameters of the random<br>
forest to limit the size (depth, branching) of the trees.<br>
<br>
On 16 March 2017 at 12:20, Sasha Kacanski <<a href="mailto:skacanski@gmail.com" target="_blank">skacanski@gmail.com</a><br>
<mailto:<a href="mailto:skacanski@gmail.com" target="_blank">skacanski@gmail.com</a>>> wrote:<br>
<br>
Hi,<br>
As soon as number of trees and features goes higher, 70Gb of ram is<br>
gone and i am getting out of memory errors.<br>
file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns<br>
but there is ton of text ...<br>
with 10 estimators and 100 features per word I can't tackle ~900 k<br>
of records ...<br>
Training set, about 15% of data does perfectly fine but when test<br>
come that is it.<br>
<br>
i can split stuff and multiprocess it but I believe that will simply<br>
skew results...<br>
<br>
Any ideas?<br>
<br>
<br>
--<br>
Aleksandar Kacanski<br>
<br>
______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a> <mailto:<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.or<wbr>g</a>><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>
<<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailm<wbr>an/listinfo/scikit-learn</a>><br>
<br>
<br>
<br>
<br>
______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>
<br>
</blockquote>
<br>
______________________________<wbr>_________________<br>
scikit-learn mailing list<br>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>
</blockquote></div></div>