[Numpy-discussion] fromfile() for reading text (one more time!)

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Jan 7 18:15:46 EST 2010


On Thu, Jan 7, 2010 at 4:45 PM, Christopher Barker
<Chris.Barker at noaa.gov> wrote:
> Bruce Southey wrote:
>>> <Chris.Barker at noaa.gov> wrote:
>
>> Using the numpy NaN or similar (noting R's approach to missing values
>> which in turn allows it to have the above functionality) is just a
>> very bad idea for missing values because you always have to check that
>> which NaN is a missing value and which was due to some numerical
>> calculation.
>
> well, this is specific to reading files, so you know where it came from.
> And the principle of fromfile() is that it is fast and simple, if you
> want masked arrays, use slower, but more full-featured methods.
>
> However, in this case:
>
> In [9]: np.fromstring("3, 4, NaN, 5", sep=",")
> Out[9]: array([  3.,   4.,  NaN,   5.])
>
>
> An actual NaN is read from the file, rather than a missing value.
> Perhaps the user does want the distinction, so maybe it should really
> only fil it in if the users asks for it, but specifying
> "missing_value=np.nan" or something.
>
>>>From what I can see is that you expect that fromfile() should only
>> split at the supplied delimiters, optionally(?) strip any whitespace
>
> whitespace stripping is not optional.
>
>> Your output from this string '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>> actually assumes multiple delimiters because there is no comma between
>> 4 and 5 and 8 and 9.
>
> Yes, that's the point. I thought about allowing arbitrary multiple
> delimiters, but I think '/n' is a special case - for instance, a comma
> at the end of some numbers might mean missing data, but a '\n' would not.
>
> And I couldn't really think of a useful use-case for arbitrary multiple
> delimiters.
>
>> In Josef's last case how many 'missing values should there be?
>
>  >> extra newlines at end of file
>  >> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'
>
> none -- exactly why I think \n is a special case.
>
> What about:
>  >> extra newlines in the middle of the file
>  >> str =  '1, 2, 3, 4\n\n5, 6, 7, 8\n9, 10, 11, 12\n'
>
> I think they should be ignored, but I hope I'm not making something that
> is too specific to my personal needs.
>
> Travis Oliphant wrote:
>> +1 (ignoring new-lines transparently is a nice feature).  You can also
>> use sscanf with weave to read most files.
>
> right -- but that requires weave. In fact, MATLAB has a fscanf function
> that allows you to pass in a C format string and it vectorizes it to use
> the same one over an over again until it's done. It's actually quite
> powerful and flexible. I once started with that in mind, but didn't have
> the C chops to do it. I ended up with a tool that only did doubles (come
> to think of it, MATLAB only does doubles, anyway...)
>
> I may some day write a whole new C (or, more likely, Cython) function
> that does something like that, but for now, I'm jsut trying to get
> fromfile to be useful for me.
>
>
>> +1   (much preferrable to insert NaN or other user value than raise
>> ValueError in my opinion)
>
> But raise an error for integer types?
>
> I guess this is still up the air -- no consensus yet.

raise an exception, I hate the silent cast of nan to integer zero, too
much debugging and useless if there are real zeros.
(or use some -999 kind of thing if user defined nan codes are allowed,
but I just work with float if I expect nans/missing values.)

Josef

>
> Thanks,
>
> -Chris
>
>
>
>
>
>
>
>
>
> --
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>



More information about the NumPy-Discussion mailing list