[scikit-learn] Memory efficient TfidfVectorizer

Peng Yu pengyu.ut at gmail.com
Tue Jan 28 06:26:34 EST 2020


> Are you concerned about storing the whole corpus text in memory, or the
> whole corpus' statistics? If the text, use input='file' or input='filename'
> (or a generator of texts).

I am not really sure which stage takes the most memory as my program
kills itself due to memory limitation. But I suspect it is the latter
(whole corpus statistics) that takes the most memory? (I used
1<=ngram<=3).

-- 
Regards,
Peng


More information about the scikit-learn mailing list