[Numpy-discussion] fromfile() for reading text (one more time!)
Christopher Barker
Chris.Barker at noaa.gov
Thu Jan 7 16:45:41 EST 2010
Bruce Southey wrote:
>> <Chris.Barker at noaa.gov> wrote:
> Using the numpy NaN or similar (noting R's approach to missing values
> which in turn allows it to have the above functionality) is just a
> very bad idea for missing values because you always have to check that
> which NaN is a missing value and which was due to some numerical
> calculation.
well, this is specific to reading files, so you know where it came from.
And the principle of fromfile() is that it is fast and simple, if you
want masked arrays, use slower, but more full-featured methods.
However, in this case:
In [9]: np.fromstring("3, 4, NaN, 5", sep=",")
Out[9]: array([ 3., 4., NaN, 5.])
An actual NaN is read from the file, rather than a missing value.
Perhaps the user does want the distinction, so maybe it should really
only fil it in if the users asks for it, but specifying
"missing_value=np.nan" or something.
>>From what I can see is that you expect that fromfile() should only
> split at the supplied delimiters, optionally(?) strip any whitespace
whitespace stripping is not optional.
> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
> actually assumes multiple delimiters because there is no comma between
> 4 and 5 and 8 and 9.
Yes, that's the point. I thought about allowing arbitrary multiple
delimiters, but I think '/n' is a special case - for instance, a comma
at the end of some numbers might mean missing data, but a '\n' would not.
And I couldn't really think of a useful use-case for arbitrary multiple
delimiters.
> In Josef's last case how many 'missing values should there be?
>> extra newlines at end of file
>> str = '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'
none -- exactly why I think \n is a special case.
What about:
>> extra newlines in the middle of the file
>> str = '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'
I think they should be ignored, but I hope I'm not making something that
is too specific to my personal needs.
Travis Oliphant wrote:
> +1 (ignoring new-lines transparently is a nice feature). You can also
> use sscanf with weave to read most files.
right -- but that requires weave. In fact, MATLAB has a fscanf function
that allows you to pass in a C format string and it vectorizes it to use
the same one over an over again until it's done. It's actually quite
powerful and flexible. I once started with that in mind, but didn't have
the C chops to do it. I ended up with a tool that only did doubles (come
to think of it, MATLAB only does doubles, anyway...)
I may some day write a whole new C (or, more likely, Cython) function
that does something like that, but for now, I'm jsut trying to get
fromfile to be useful for me.
> +1 (much preferrable to insert NaN or other user value than raise
> ValueError in my opinion)
But raise an error for integer types?
I guess this is still up the air -- no consensus yet.
Thanks,
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
More information about the NumPy-Discussion
mailing list