[Numpy-discussion] Data filtering with np.genfromtxt

Éric Depagne eric at depagne.org
Tue Sep 24 10:19:08 EDT 2019


Hi all, 

I am reading  large csv file, that has 8.5 million lines and 216 columns using genfromtxt.
I'm not interested in all of the 216 columns, so I filter them out using the  "usecols" and 
"converters" parameters.

That works very well, but in my original large file, all the columns I extract are not filled 
with values. As expected in these cases, genfromtxt replaces them by nan and thus, in the 
final array, there are rows that contain these nans. 
I'd like to know if there is a way to filterout at the genfromtxt level the lines that do contain 
these nans, so that they do not appear in my final array. 

I'd like to have something like:
genfromtxt extracts the line using the parameters I need.
If the extracted line contains a NaN, do nothing and process the next line. 
If it has no NaNs, add it to the output array as usual.

I could of course remove in the array created by genfromtxt() all the rows that contain nans 
(and x[~np.isnan(x).any(axis=1)] does it nicely), but I'd like to be able to get a given size of 
the output array. 
The idea is that I can get, for instance, the first 10000 (or any number) lines of the input file 
that contain all the columns I need not just the first 10000.

I've found a few examples on SO that do some filtering, but the ones I've found do not 
process the extracted lines.

Any help appreciated.

Éric.

-- 
Un clavier azerty en vaut deux
----------------------------------------------------------
Éric Depagne                            

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20190924/59c9c5df/attachment-0001.html>


More information about the NumPy-Discussion mailing list