[Numpy-discussion] Data filtering with np.genfromtxt
Éric Depagne
eric at depagne.org
Tue Sep 24 10:19:08 EDT 2019
Hi all,
I am reading large csv file, that has 8.5 million lines and 216 columns using genfromtxt.
I'm not interested in all of the 216 columns, so I filter them out using the "usecols" and
"converters" parameters.
That works very well, but in my original large file, all the columns I extract are not filled
with values. As expected in these cases, genfromtxt replaces them by nan and thus, in the
final array, there are rows that contain these nans.
I'd like to know if there is a way to filterout at the genfromtxt level the lines that do contain
these nans, so that they do not appear in my final array.
I'd like to have something like:
genfromtxt extracts the line using the parameters I need.
If the extracted line contains a NaN, do nothing and process the next line.
If it has no NaNs, add it to the output array as usual.
I could of course remove in the array created by genfromtxt() all the rows that contain nans
(and x[~np.isnan(x).any(axis=1)] does it nicely), but I'd like to be able to get a given size of
the output array.
The idea is that I can get, for instance, the first 10000 (or any number) lines of the input file
that contain all the columns I need not just the first 10000.
I've found a few examples on SO that do some filtering, but the ones I've found do not
process the extracted lines.
Any help appreciated.
Éric.
--
Un clavier azerty en vaut deux
----------------------------------------------------------
Éric Depagne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190924/59c9c5df/attachment-0001.html>
More information about the NumPy-Discussion
mailing list