[SciPy-User] How to read row_name, col_name, value format TSV into a sparse matrix?

Lingyi Hu lingyihuu at gmail.com
Wed Jan 29 06:11:03 EST 2020


Thomas,

Unless I'm misunderstanding, I think Peng Yu doesn't want to read the zeros
(or empty values) from the tsv file into memory. I'm pretty sure
pandas.read_csv reads your whole data into memory, zeros or not. There is
no option to read it in a sparse format (only store the position of nonzero
entries). So that doesn't solve the problem.

I think you can also read it in chunks, call df.to_sparse to convert to a
sparse matrix for each chunk and concat them. I'm not sure if you've seen
this:
https://stackoverflow.com/questions/31888856/read-a-large-csv-into-a-sparse-pandas-dataframe-in-a-memory-efficient-way,
but it might also offer some useful insights.

On Wed, Jan 29, 2020 at 5:57 PM Thomas Kluyver <takowl at gmail.com> wrote:

> On Wed, 29 Jan 2020 at 09:34, Peng Yu <pengyu.ut at gmail.com> wrote:
>
>> Where it documented that pandas.read_csv don't generate the whole
>> matrix? The return value is either of the two?
>>
>
> It returns a 2D data structure as in the rows and columns of your CSV file
> - so the shape will be (3, n_entries). It doesn't try to interpret them as
> referring to entries in a matrix - you have to do that as a separate step.
>
> It's probably not exactly documented like this, because documentation
> doesn't usually say what a function *doesn't* do, unless it's a very common
> confusion.
>
> Thomas
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at python.org
> https://mail.python.org/mailman/listinfo/scipy-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-user/attachments/20200129/f9276771/attachment-0001.html>


More information about the SciPy-User mailing list