[Numpy-discussion] fromfile() for reading text (one more time!)

Bruce Southey bsouthey at gmail.com
Thu Jan 7 23:10:39 EST 2010

On Thu, Jan 7, 2010 at 3:45 PM, Christopher Barker
<Chris.Barker at noaa.gov> wrote:
> Bruce Southey wrote:
>>> <Chris.Barker at noaa.gov> wrote:
>> Using the numpy NaN or similar (noting R's approach to missing values
>> which in turn allows it to have the above functionality) is just a
>> very bad idea for missing values because you always have to check that
>> which NaN is a missing value and which was due to some numerical
>> calculation.
> well, this is specific to reading files, so you know where it came from.

You can only know where it came from when you compare the original
array to the transformed one. Also a user has to check for missing
values or numpy has to warn a user that missing values are present
immediately after reading the data so the appropriate action can be
taken (like using functions that handle missing values appropriately).
That is my second problem with using codes (NaN, -99999 etc)  for
missing values.

> And the principle of fromfile() is that it is fast and simple, if you
> want masked arrays, use slower, but more full-featured methods.

So in that case it should fail with missing data.

> However, in this case:
> In [9]: np.fromstring("3, 4, NaN, 5", sep=",")
> Out[9]: array([  3.,   4.,  NaN,   5.])
> An actual NaN is read from the file, rather than a missing value.
> Perhaps the user does want the distinction, so maybe it should really
> only fil it in if the users asks for it, but specifying
> "missing_value=np.nan" or something.

Yes, that is my first problem of using predefined codes for missing
values as you do not always know what is going to occur in the data.

>>>From what I can see is that you expect that fromfile() should only
>> split at the supplied delimiters, optionally(?) strip any whitespace
> whitespace stripping is not optional.
>> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>> actually assumes multiple delimiters because there is no comma between
>> 4 and 5 and 8 and 9.
> Yes, that's the point. I thought about allowing arbitrary multiple
> delimiters, but I think '/n' is a special case - for instance, a comma
> at the end of some numbers might mean missing data, but a '\n' would not.
> And I couldn't really think of a useful use-case for arbitrary multiple
> delimiters.
>> In Josef's last case how many 'missing values should there be?
>  >> extra newlines at end of file
>  >> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'
> none -- exactly why I think \n is a special case.

What about '\r' and '\n\r'?

> What about:
>  >> extra newlines in the middle of the file
>  >> str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'
> I think they should be ignored, but I hope I'm not making something that
> is too specific to my personal needs.

Not really, it is more that I am being somewhat difficult to ensure I
understand what you actually need.

My problem with this is that you are reading one huge 1-D array  (that
you can resize later) rather than a 2-D array with rows and columns
(which is what I deal with). But I agree that you can have an option
to say treat '\n' or '\r' as a delimiter but I think it should be
turned off by default.

> Travis Oliphant wrote:
>> +1 (ignoring new-lines transparently is a nice feature).  You can also
>> use sscanf with weave to read most files.
> right -- but that requires weave. In fact, MATLAB has a fscanf function
> that allows you to pass in a C format string and it vectorizes it to use
> the same one over an over again until it's done. It's actually quite
> powerful and flexible. I once started with that in mind, but didn't have
> the C chops to do it. I ended up with a tool that only did doubles (come
> to think of it, MATLAB only does doubles, anyway...)
> I may some day write a whole new C (or, more likely, Cython) function
> that does something like that, but for now, I'm jsut trying to get
> fromfile to be useful for me.
>> +1   (much preferrable to insert NaN or other user value than raise
>> ValueError in my opinion)
> But raise an error for integer types?
> I guess this is still up the air -- no consensus yet.
> Thanks,
> -Chris

You should have a corresponding value for ints because raising an
exceptionwould be inconsistent with allowing floats to have a value.
If you must keep the user defined dtype then, as Josef suggests, just
use some code be it -999 or most negative number supported by the OS
for the defined dtype or, just convert the ints into floats if the
user does not define a missing value code.  It would be nice to either
return the number of missing values or display a warning indicating
how many occurred.


More information about the NumPy-Discussion mailing list