Memory efficient TfidfVectorizer
Hi, To use TfidfVectorizer, the whole corpus must be used into memory. This can be a problem for machines without a lot of memory. Is there a way to use only a small amount of memory by saving most intermediate results in the disk? Thanks. -- Regards, Peng
Are you concerned about storing the whole corpus text in memory, or the whole corpus' statistics? If the text, use input='file' or input='filename' (or a generator of texts). On Tue, 28 Jan 2020 at 18:01, Peng Yu <pengyu.ut@gmail.com> wrote:
Hi,
To use TfidfVectorizer, the whole corpus must be used into memory. This can be a problem for machines without a lot of memory. Is there a way to use only a small amount of memory by saving most intermediate results in the disk? Thanks.
-- Regards, Peng _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Are you concerned about storing the whole corpus text in memory, or the whole corpus' statistics? If the text, use input='file' or input='filename' (or a generator of texts).
I am not really sure which stage takes the most memory as my program kills itself due to memory limitation. But I suspect it is the latter (whole corpus statistics) that takes the most memory? (I used 1<=ngram<=3). -- Regards, Peng
participants (2)
-
Joel Nothman -
Peng Yu