[scikit-learn] best way to scale on the random forest for text w bag of words ...
Roman Yurchak
rth.yurchak at gmail.com
Thu Mar 16 08:25:44 EDT 2017
If you run out of memory at the prediction step, splitting the test
dataset in batches, then concatenating the results should work fine. Why
would it "skew" the results?
70GB RAM seems huge: for comparison here is some categorization
benchmarks on a 700k text dataset, that use more in the order of 5-10 GB
RAM,
https://github.com/FreeDiscovery/FreeDiscovery/issues/58
though with fairly short documents, for other algorithms and with a
smaller training set.
You could also try reducing the size of your dictionary with hashing.
If you really want to use random forest and have memory constraints, you
might want to use n_jobs=1 to avoid memory copies,
https://www.quora.com/Why-is-scikit-learns-random-forest-using-so-much-memory
But as Joel was saying, random forest might not the best choice for huge
sparse arrays; NaiveBayes, LogisticRegression or SVM could be better
suited, or gradient boosting if you want to go that way...
On 16/03/17 02:44, Joel Nothman wrote:
> Trees are not a traditional choice for bag of words models, but you
> should make sure you are at least using the parameters of the random
> forest to limit the size (depth, branching) of the trees.
>
> On 16 March 2017 at 12:20, Sasha Kacanski <skacanski at gmail.com
> <mailto:skacanski at gmail.com>> wrote:
>
> Hi,
> As soon as number of trees and features goes higher, 70Gb of ram is
> gone and i am getting out of memory errors.
> file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns
> but there is ton of text ...
> with 10 estimators and 100 features per word I can't tackle ~900 k
> of records ...
> Training set, about 15% of data does perfectly fine but when test
> come that is it.
>
> i can split stuff and multiprocess it but I believe that will simply
> skew results...
>
> Any ideas?
>
>
> --
> Aleksandar Kacanski
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org <mailto:scikit-learn at python.org>
> https://mail.python.org/mailman/listinfo/scikit-learn
> <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
More information about the scikit-learn
mailing list