[scikit-learn] best way to scale on the random forest for text w bag of words ...

Roman Yurchak rth.yurchak at gmail.com
Thu Mar 16 08:25:44 EDT 2017


If you run out of memory at the prediction step, splitting the test 
dataset in batches, then concatenating the results should work fine. Why 
would it "skew" the results?

70GB RAM seems huge: for comparison here is some categorization 
benchmarks on a 700k text dataset, that use more in the order of 5-10 GB 
RAM,
     https://github.com/FreeDiscovery/FreeDiscovery/issues/58
though with fairly short documents, for other algorithms and with a 
smaller training set.

You could also try reducing the size of your dictionary with hashing.
If you really want to use random forest and have memory constraints, you 
might want to use n_jobs=1 to avoid memory copies,
 
https://www.quora.com/Why-is-scikit-learns-random-forest-using-so-much-memory

But as Joel was saying, random forest might not the best choice for huge 
sparse arrays; NaiveBayes, LogisticRegression or SVM could be better 
suited, or gradient boosting if you want to go that way...


On 16/03/17 02:44, Joel Nothman wrote:
> Trees are not a traditional choice for bag of words models, but you
> should make sure you are at least using the parameters of the random
> forest to limit the size (depth, branching) of the trees.
>
> On 16 March 2017 at 12:20, Sasha Kacanski <skacanski at gmail.com
> <mailto:skacanski at gmail.com>> wrote:
>
>     Hi,
>     As soon as number of trees and features goes higher, 70Gb of ram is
>     gone and i am getting out of memory errors.
>     file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns
>     but there is ton of text ...
>     with 10 estimators and 100 features per word I can't tackle ~900 k
>     of records ...
>     Training set, about 15% of data does perfectly fine but when test
>     come that is it.
>
>     i can split stuff and multiprocess it but I believe that will simply
>     skew results...
>
>     Any ideas?
>
>
>     --
>     Aleksandar Kacanski
>
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>



More information about the scikit-learn mailing list