[scikit-learn] best way to scale on the random forest for text w bag of words ...
Sasha Kacanski
skacanski at gmail.com
Thu Mar 16 10:23:36 EDT 2017
Thank you very much...
I will try alternatives
Sasha Kacanski
On Mar 16, 2017 8:28 AM, "Roman Yurchak" <rth.yurchak at gmail.com> wrote:
> If you run out of memory at the prediction step, splitting the test
> dataset in batches, then concatenating the results should work fine. Why
> would it "skew" the results?
>
> 70GB RAM seems huge: for comparison here is some categorization benchmarks
> on a 700k text dataset, that use more in the order of 5-10 GB RAM,
> https://github.com/FreeDiscovery/FreeDiscovery/issues/58
> though with fairly short documents, for other algorithms and with a
> smaller training set.
>
> You could also try reducing the size of your dictionary with hashing.
> If you really want to use random forest and have memory constraints, you
> might want to use n_jobs=1 to avoid memory copies,
>
> https://www.quora.com/Why-is-scikit-learns-random-forest-usi
> ng-so-much-memory
>
> But as Joel was saying, random forest might not the best choice for huge
> sparse arrays; NaiveBayes, LogisticRegression or SVM could be better
> suited, or gradient boosting if you want to go that way...
>
>
> On 16/03/17 02:44, Joel Nothman wrote:
>
>> Trees are not a traditional choice for bag of words models, but you
>> should make sure you are at least using the parameters of the random
>> forest to limit the size (depth, branching) of the trees.
>>
>> On 16 March 2017 at 12:20, Sasha Kacanski <skacanski at gmail.com
>> <mailto:skacanski at gmail.com>> wrote:
>>
>> Hi,
>> As soon as number of trees and features goes higher, 70Gb of ram is
>> gone and i am getting out of memory errors.
>> file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns
>> but there is ton of text ...
>> with 10 estimators and 100 features per word I can't tackle ~900 k
>> of records ...
>> Training set, about 15% of data does perfectly fine but when test
>> come that is it.
>>
>> i can split stuff and multiprocess it but I believe that will simply
>> skew results...
>>
>> Any ideas?
>>
>>
>> --
>> Aleksandar Kacanski
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>>
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170316/a91ba0a2/attachment-0001.html>
More information about the scikit-learn
mailing list