[Numpy-discussion] fromfile() for reading text (one more time!)

Fri Jan 8 20:15:43 EST 2010

On Fri, Jan 8, 2010 at 5:12 PM, Christopher Barker
<Chris.Barker at noaa.gov> wrote:
> Bruce Southey wrote:
>> Also a user has to check for missing
>> values or numpy has to warn a user
>
> I think warnings are next to useless for all but interactive work -- so
> I don't want to rely on them
>
>> that missing values are present
>> immediately after reading the data so the appropriate action can be
>> taken (like using functions that handle missing values appropriately).
>> That is my second problem with using codes (NaN, -99999 etc)  for
>> missing values.
>
> But I think you're right -- if someone write code, tests with good
> input, then later runs it with missing valued import, they are likely to
> have not ever bothered to test for missing values.
>
> So I think missing values should only be replaced by something if the
> user specifically asks for it.
>
>>> And the principle of fromfile() is that it is fast and simple, if you
>>> want masked arrays, use slower, but more full-featured methods.
>>
>> So in that case it should fail with missing data.
>
> Well, I'm not so sure -- the point is performance, no reason not to have
> high performing code that handles missing data.
>
>> What about '\r' and '\n\r'?
>
> I have thought about that -- I'm hoping that python's text file reading
> will just take care of it, but as we're working with C file handles here
> (I think), I guess not. '/n/r' is easy -- the '/r' is just extra
> whitespace. 'r' is another case to handle.
>
>
>> My problem with this is that you are reading one huge 1-D array  (that
>> you can resize later) rather than a 2-D array with rows and columns
>> (which is what I deal with).
>
> That's because fromfile()) is not designed to be row-oriented at all,
> and the binary read certainly isn't. I'm just trying to make this easy
> -- though it's not turning out that way!
>
>  > But I agree that you can have an option
>> to say treat '\n' or '\r' as a delimiter but I think it should be
>> turned off by default.
>
> that's what I've done.
>
>> You should have a corresponding value for ints because raising an
>> exceptionwould be inconsistent with allowing floats to have a value.
>
> I'm not sure I care, really -- but I think having the user specify the
> fill value is the best option, anyway.
>
> josef.pktd at gmail.com wrote:
>>>> none -- exactly why I think \n is a special case.
>>> What about '\r' and '\n\r'?
>>
>> Yes, I forgot about this, and it will be the most common case for
>> Windows users like myself.
>>
>> I think \r should be stripped automatically, like in non-binary
>> reading of files in python.
>
> except for folks like me that have old mac files laying around...so I
> want this like "Universal newlines" support.
>
>> A warning would be good, but doing np.any(np.isnan(x)) or
>> np.isnan(x).sum() on the result is always a good idea for a user when
>> missing values are possibility.
>
> right, but the issue is the user has to know that they are possible, and
> we all know how carefully we all read docs!
>
> Thanks for your input -- I think I know what I'd like to do, but it's
> proving less than trivial to do it, so we'll see.
>
> In short:
>
> 1) optionally allow newlines to serve as a delimiter, so large tables
> can be read.
>
> 2) raise an exception for missing values, unless:
>   3) the user specifies a fill value of their choice (compatible with
> the chosen data type.
>
>
> -Chris
>
>

I fully agree with your approach!
Thanks for considering my thoughts!

Bruce