[scikit-learn] best way to scale on the random forest for text w bag of words ...

Thu Mar 16 10:23:36 EDT 2017

Thank you very much...
I will try alternatives

Sasha Kacanski

On Mar 16, 2017 8:28 AM, "Roman Yurchak" <rth.yurchak at gmail.com> wrote:

> If you run out of memory at the prediction step, splitting the test
> dataset in batches, then concatenating the results should work fine. Why
> would it "skew" the results?
>
> 70GB RAM seems huge: for comparison here is some categorization benchmarks
> on a 700k text dataset, that use more in the order of 5-10 GB RAM,
>     https://github.com/FreeDiscovery/FreeDiscovery/issues/58
> though with fairly short documents, for other algorithms and with a
> smaller training set.
>
> You could also try reducing the size of your dictionary with hashing.
> If you really want to use random forest and have memory constraints, you
> might want to use n_jobs=1 to avoid memory copies,
>
> https://www.quora.com/Why-is-scikit-learns-random-forest-usi
> ng-so-much-memory
>
> But as Joel was saying, random forest might not the best choice for huge
> sparse arrays; NaiveBayes, LogisticRegression or SVM could be better
> suited, or gradient boosting if you want to go that way...
>
>
> On 16/03/17 02:44, Joel Nothman wrote:
>
>> Trees are not a traditional choice for bag of words models, but you
>> should make sure you are at least using the parameters of the random
>> forest to limit the size (depth, branching) of the trees.
>>
>> On 16 March 2017 at 12:20, Sasha Kacanski <skacanski at gmail.com
>> <mailto:skacanski at gmail.com>> wrote:
>>
>>     Hi,
>>     As soon as number of trees and features goes higher, 70Gb of ram is
>>     gone and i am getting out of memory errors.
>>     file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns
>>     but there is ton of text ...
>>     with 10 estimators and 100 features per word I can't tackle ~900 k
>>     of records ...
>>     Training set, about 15% of data does perfectly fine but when test
>>     come that is it.
>>
>>     i can split stuff and multiprocess it but I believe that will simply
>>     skew results...
>>
>>     Any ideas?
>>
>>
>>     --
>>     Aleksandar Kacanski
>>
>>     _______________________________________________
>>     scikit-learn mailing list
>>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>>     https://mail.python.org/mailman/listinfo/scikit-learn
>>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170316/a91ba0a2/attachment-0001.html>