[Numpy-discussion] fromfile() for reading text (one more time!)

Thu Jan 7 16:11:12 EST 2010

On Jan 7, 2010, at 2:32 PM, josef.pktd at gmail.com wrote:

> On Thu, Jan 7, 2010 at 3:08 PM, Christopher Barker
> <Chris.Barker at noaa.gov> wrote:
>> Pauli Virtanen wrote:
>>> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
>>> it also does odd things with spaces
>>>> embedded in the separator:
>>>>
>>>> ", $ #" matches all of:  ",$#"   ", $#"  ",$ #"
>>
>>> That's a documented feature:
>>
>> Fair enough.
>>
>> OK, I've written a patch that allows newlines to be interpreted as
>> separators in addition to whatever is specified in sep.
>>
>> In the process of testing, I found again these issues, which are  
>> still
>> marked as "needs decision".
>>
>> http://projects.scipy.org/numpy/ticket/883
>>
>> In short: what to do with missing values?
>>
>> I'd like to address this bug, but I need a decision to do so.
>>
>>
>> My proposal:
>>
>> Raise an ValueError with missing values.
>>
>>
>> Justification:
>>
>> No function should EVER return data that is not there. Period. It is
>> simply asking for hard to find bugs. Therefore:
>>
>> fromstring("3, 4,,5", sep=",")
>>
>> Should never, ever, return:
>>
>> array([ 3.,  4.,  0.,  5.])
>>
>> Which is what it does now. bad. bad. bad.
>>
>>
>>
>>
>> Alternatives:
>>
>>   A) Raising a ValueError is the easiest way to get "proper"  
>> behavior.
>> Folks can use a more sophisticated file reader if they want missing
>> values handled. I'm willing to contribute this patch.
>>
>>   B) If the dtype is a floating point type, NaN could fill in the
>> missing values -- a fine idea, but you can't use it for integers, and
>> zero is a really bad replacement!
>>
>>   C) The user could specify what they want filled in for missing
>> values. This is a fine idea, though I'm not sure I want to take the  
>> time
>> to impliment it.
>>
>> Oh, and this is a bug too, with probably the same solution:
>>
>> In [20]: np.fromstring("hjba", sep=',')
>> Out[20]: array([ 0.])
>>
>> In [26]: np.fromstring("34gytf39", sep=',')
>> Out[26]: array([ 34.])
>>
>>
>> One more unresolved question:
>>
>> what should:
>>
>> np.fromstring("3, 4, 5,", sep=",")
>>
>> return?
>>
>> it currently returns:
>>
>> array([ 3.,  4.,  5.])
>>
>> which seems a bit inconsitent with missing value handling. I also  
>> found
>> a bug:
>>
>> In [6]: np.fromstring("3, 4, 5 , ", sep=",")
>> Out[6]: array([ 3.,  4.,  5.,  0.])
>>
>> so if there is some extra whitespace in there, it does return a  
>> missing
>> value. With my proposal, that wouldn't happen, but you might get an
>> exception. I think you should, but it'll be easier to implement my
>> "allow newlines" code if not.
>>
>>
>> so, should I do (A) ?
>>
>>
>> Another question:
>>
>> I've got a patch mostly working (except for the above issues) that  
>> will
>> allow fromfile/string to read multiline non-whitespace separated  
>> data in
>> one shot:
>>
>>
>> In [15]: str
>> Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'
>>
>> In [16]: np.fromstring(str, sep=',', allow_newlines=True)
>> Out[16]:
>> array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,   
>> 11.,
>>         12.])
>>
>>
>> I think this is a very helpful enhancement, and, as it is a new  
>> kwarg,
>> backward compatible:
>>
>> 1) Might it be accepted for inclusion?
>>
>> 2) Is the name for the flag OK: "allow_newlines"? It's pretty  
>> explicit,
>> but also long -- I used it for the flag name in the C code, too.
>>
>> 3) What C datatype should I use for a boolean flag? I used a char,  
>> but I
>> don't know what the numpy standard is.
>>
>>
>> -Chris
>>
>>
>
> I don't know much about this, just a few more test cases
>
> comma and newline
> str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12'
>
> extra comma at end of file
> str =  '1, 2, 3, 4,\n5, 6, 7, 8,\n9, 10, 11, 12,'
>
> extra newlines at end of file
> str =  '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12\n\n\n'
>
> It would be nice if these cases would go through without missing
> values or exception, but I don't often have files that are clean
> enough for fromfile().

+1 (ignoring new-lines transparently is a nice feature).  You can also  
use sscanf with weave to read most files.

>
> I'm in favor of nan for missing values with floating point numbers. It
> would make it easy to read correctly formatted csv files, even if the
> data is not complete.

+1   (much preferrable to insert NaN or other user value than raise  
ValueError in my opinion)

-Travis