[Numpy-discussion] fromfile() for reading text (one more time!)

Christopher Barker Chris.Barker at noaa.gov
Thu Jan 7 16:45:41 EST 2010


Bruce Southey wrote:
>> <Chris.Barker at noaa.gov> wrote:

> Using the numpy NaN or similar (noting R's approach to missing values
> which in turn allows it to have the above functionality) is just a
> very bad idea for missing values because you always have to check that
> which NaN is a missing value and which was due to some numerical
> calculation.

well, this is specific to reading files, so you know where it came from. 
And the principle of fromfile() is that it is fast and simple, if you 
want masked arrays, use slower, but more full-featured methods.

However, in this case:

In [9]: np.fromstring("3, 4, NaN, 5", sep=",")
Out[9]: array([  3.,   4.,  NaN,   5.])


An actual NaN is read from the file, rather than a missing value. 
Perhaps the user does want the distinction, so maybe it should really 
only fil it in if the users asks for it, but specifying 
"missing_value=np.nan" or something.

>>From what I can see is that you expect that fromfile() should only
> split at the supplied delimiters, optionally(?) strip any whitespace

whitespace stripping is not optional.

> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
> actually assumes multiple delimiters because there is no comma between
> 4 and 5 and 8 and 9.

Yes, that's the point. I thought about allowing arbitrary multiple 
delimiters, but I think '/n' is a special case - for instance, a comma 
at the end of some numbers might mean missing data, but a '\n' would not.

And I couldn't really think of a useful use-case for arbitrary multiple 
delimiters.

> In Josef's last case how many 'missing values should there be?

 >> extra newlines at end of file
 >> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'

none -- exactly why I think \n is a special case.

What about:
 >> extra newlines in the middle of the file
 >> str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'

I think they should be ignored, but I hope I'm not making something that 
is too specific to my personal needs.

Travis Oliphant wrote:
> +1 (ignoring new-lines transparently is a nice feature).  You can also  
> use sscanf with weave to read most files.

right -- but that requires weave. In fact, MATLAB has a fscanf function 
that allows you to pass in a C format string and it vectorizes it to use 
the same one over an over again until it's done. It's actually quite 
powerful and flexible. I once started with that in mind, but didn't have 
the C chops to do it. I ended up with a tool that only did doubles (come 
to think of it, MATLAB only does doubles, anyway...)

I may some day write a whole new C (or, more likely, Cython) function 
that does something like that, but for now, I'm jsut trying to get 
fromfile to be useful for me.


> +1   (much preferrable to insert NaN or other user value than raise  
> ValueError in my opinion)

But raise an error for integer types?

I guess this is still up the air -- no consensus yet.

Thanks,

-Chris









-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov



More information about the NumPy-Discussion mailing list