[Numpy-discussion] fromfile() for reading text (one more time!)

Thu Jan 7 15:08:23 EST 2010

Pauli Virtanen wrote:
> ma, 2010-01-04 kello 17:05 -0800, Christopher Barker kirjoitti:
> it also does odd things with spaces 
>> embedded in the separator:
>>
>> ", $ #" matches all of:  ",$#"   ", $#"  ",$ #"

> That's a documented feature:

Fair enough.

OK, I've written a patch that allows newlines to be interpreted as 
separators in addition to whatever is specified in sep.

In the process of testing, I found again these issues, which are still 
marked as "needs decision".

http://projects.scipy.org/numpy/ticket/883

In short: what to do with missing values?

I'd like to address this bug, but I need a decision to do so.

My proposal:

Raise an ValueError with missing values.

Justification:

No function should EVER return data that is not there. Period. It is 
simply asking for hard to find bugs. Therefore:

fromstring("3, 4,,5", sep=",")

Should never, ever, return:

array([ 3.,  4.,  0.,  5.])

Which is what it does now. bad. bad. bad.

Alternatives:

   A) Raising a ValueError is the easiest way to get "proper" behavior. 
Folks can use a more sophisticated file reader if they want missing 
values handled. I'm willing to contribute this patch.

   B) If the dtype is a floating point type, NaN could fill in the 
missing values -- a fine idea, but you can't use it for integers, and 
zero is a really bad replacement!

   C) The user could specify what they want filled in for missing 
values. This is a fine idea, though I'm not sure I want to take the time 
to impliment it.

Oh, and this is a bug too, with probably the same solution:

In [20]: np.fromstring("hjba", sep=',')
Out[20]: array([ 0.])

In [26]: np.fromstring("34gytf39", sep=',')
Out[26]: array([ 34.])

One more unresolved question:

what should:

np.fromstring("3, 4, 5,", sep=",")

return?

it currently returns:

array([ 3.,  4.,  5.])

which seems a bit inconsitent with missing value handling. I also found 
a bug:

In [6]: np.fromstring("3, 4, 5 , ", sep=",")
Out[6]: array([ 3.,  4.,  5.,  0.])

so if there is some extra whitespace in there, it does return a missing 
value. With my proposal, that wouldn't happen, but you might get an 
exception. I think you should, but it'll be easier to implement my 
"allow newlines" code if not.

so, should I do (A) ?

Another question:

I've got a patch mostly working (except for the above issues) that will 
allow fromfile/string to read multiline non-whitespace separated data in 
one shot:

In [15]: str
Out[15]: '1, 2, 3, 4\n5, 6, 7, 8\n9, 10, 11, 12'

In [16]: np.fromstring(str, sep=',', allow_newlines=True)
Out[16]:
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.,  11.,
         12.])

I think this is a very helpful enhancement, and, as it is a new kwarg, 
backward compatible:

1) Might it be accepted for inclusion?

2) Is the name for the flag OK: "allow_newlines"? It's pretty explicit, 
but also long -- I used it for the flag name in the C code, too.

3) What C datatype should I use for a boolean flag? I used a char, but I 
don't know what the numpy standard is.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov