<div dir="auto">Thank you very much...<div dir="auto">I will try alternatives<br><br><div data-smartmail="gmail_signature" dir="auto">Sasha Kacanski </div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mar 16, 2017 8:28 AM, "Roman Yurchak" <<a href="mailto:rth.yurchak@gmail.com">rth.yurchak@gmail.com</a>> wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">If you run out of memory at the prediction step, splitting the test dataset in batches, then concatenating the results should work fine. Why would it "skew" the results?<br>

<br>

70GB RAM seems huge: for comparison here is some categorization benchmarks on a 700k text dataset, that use more in the order of 5-10 GB RAM,<br>

    <a href="https://github.com/FreeDiscovery/FreeDiscovery/issues/58" rel="noreferrer" target="_blank">https://github.com/FreeDiscove<wbr>ry/FreeDiscovery/issues/58</a><br>

though with fairly short documents, for other algorithms and with a smaller training set.<br>

<br>

You could also try reducing the size of your dictionary with hashing.<br>

If you really want to use random forest and have memory constraints, you might want to use n_jobs=1 to avoid memory copies,<br>

<br>

<a href="https://www.quora.com/Why-is-scikit-learns-random-forest-using-so-much-memory" rel="noreferrer" target="_blank">https://www.quora.com/Why-is-s<wbr>cikit-learns-random-forest-usi<wbr>ng-so-much-memory</a><br>

<br>

But as Joel was saying, random forest might not the best choice for huge sparse arrays; NaiveBayes, LogisticRegression or SVM could be better suited, or gradient boosting if you want to go that way...<br>

<br>

<br>

On 16/03/17 02:44, Joel Nothman wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Trees are not a traditional choice for bag of words models, but you<br>

should make sure you are at least using the parameters of the random<br>

forest to limit the size (depth, branching) of the trees.<br>

<br>

On 16 March 2017 at 12:20, Sasha Kacanski <<a href="mailto:skacanski@gmail.com" target="_blank">skacanski@gmail.com</a><br>

<mailto:<a href="mailto:skacanski@gmail.com" target="_blank">skacanski@gmail.com</a>>> wrote:<br>

<br>

    Hi,<br>

    As soon as number of trees and features goes higher, 70Gb of ram is<br>

    gone and i am getting out of memory errors.<br>

    file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns<br>

    but there is ton of text ...<br>

    with 10 estimators and 100 features per word I can't tackle ~900 k<br>

    of records ...<br>

    Training set, about 15% of data does perfectly fine but when test<br>

    come that is it.<br>

<br>

    i can split stuff and multiprocess it but I believe that will simply<br>

    skew results...<br>

<br>

    Any ideas?<br>

<br>

<br>

    --<br>

    Aleksandar Kacanski<br>

<br>

    ______________________________<wbr>_________________<br>

    scikit-learn mailing list<br>

    <a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a> <mailto:<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.or<wbr>g</a>><br>

    <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

    <<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailm<wbr>an/listinfo/scikit-learn</a>><br>

<br>

<br>

<br>

<br>

______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

<br>

</blockquote>

<br>

______________________________<wbr>_________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>

</blockquote></div></div>