[Numpy-discussion] parsing text strings/files in fromfile, fromstring

Sun May 24 19:28:05 EDT 2009

Sun, 24 May 2009 14:29:42 -0600, Charles R Harris wrote:
> I am trying to put together some rule for parsing text strings/files in
> fromfile, fromstring so that the two are consistent. Tickets relevant to
> this are #1116 <http://projects.scipy.org/numpy/ticket/1116> and
> #883<http://projects.scipy.org/numpy/ticket/883>. The question here is
> the interpretation of the separators, not the parsing of the numbers
> themselves. Below is the current behavior of fromstring, fromfile, and
> python split for content of "", "1", "1 1", " " respectively.

It should return only the data that's in the file, no extra elements. The 
current behavior is a bug, IMHO, especially so since the default value is 
uninitialized IIRC.

So,

	fromstring("", sep=" ") -> []
	fromstring(" ", sep=" ") -> []
	fromstring("1 ", sep=" ") -> [1]

fromfile should behave identically.

Another question is perhaps what to do with malformed input: whether
to try best-efforts parsing, or bail  out. I'd suggest bailing out
when encountering bad data rather than guessing:

	fromstring("1,2,,3", sep=",") -> [1,2] or ValueError

Currently, something horrible happens:

	>>> np.fromstring('1,2,,3,,,6', sep=',')
	array([ 1.,  2., -1.,  3., -1., -1.,  6.])

Also, on second thoughts, the idea about raising a warning on malformed 
input seems more repulsive the more I think about it. Warnings are a bit 
nasty to catch, spam stderr if uncaught, and IMHO should not be a part of 
"business as usual" code paths. Having malformed input is business as 
usual :)

In some sense, it would be simpler if `fromfile` and `fromstring` would 
be defined so that they read *at most* `count` entries, and return what 
they got by parsing the leftmost valid part. This could be implemented by 
fixing the current bugs and removing the fprintf that currently prints to 
stderr there.

As an addition, a flag could be added that forces them to raise a 
ValueError on malformed input (eg. EOF when `count` was given, or bad 
separator encountered). Ideally, the exceptions flag would be True by 
default both for fromfile and fromstring, but I guess some legacy 
applications might rely on the current behavior...

Also, one could envision a "default" value that would denote a batch of 
malformed input...

   ***

So, I see a couple of alternatives (some already suggested):

a) fromstring("1,2,x,4", sep=",") -> [1,2]
   fromstring("1,2,x,4", sep=",", strict=True) -> ValueError
   fromstring("1,2,x,4", sep=",", count=5) -> [1,2]
   fromstring("1,2,x,4", sep=",", count=5, strict=True) -> ValueError

b) fromstring("1,2,x,4", sep=",") -> [1,2]
   fromstring("1,2,x,4", sep=",", strict=True) -> ValueError
   fromstring("1,2,x,4", sep=",", default=3) -> [1,2,3,4]
   fromstring("1,2,x,4", sep=",", count=5) -> [1,2]
   fromstring("1,2,x,4", sep=",", count=5, strict=True) -> ValueError

c) fromstring("1,2,x,4", sep=",") -> [1,2] + SomeWarning
   fromstring("1,2,x,4", sep=",", count=5) -> [1,2] + SomeWarning

d) fromstring("1,2,x,4", sep=",") -> [1,2] + SomeWarning
   fromstring("1,2,x,4", sep=",", default=3) -> [1,2,3,4]
   fromstring("1,2,x,4", sep=",", default=3, count=5) -> [1,2,3,4] + SomeWarning

e) fromstring("1,2,x,4", sep=",") -> ValueError
   fromstring("1,2,x,4", sep=",", strict=False) -> [1,2]
   fromstring("1,2,x,4", sep=",", count=5) -> ValueError
   fromstring("1,2,x,4", sep=",", count=5, strict=False) -> [1,2]

Fromfile would always behave the same way as `fromstring(file.read())`.
In the above, " " in sep would equal the regexp \w+, and binary data
implied by sep='' would be interpreted in the same way it would if first
converted to comma-separated text.

Can you think of any other alternatives? (Let's forget the names of
the new keyword arguments for the present, and assume they have
perfectly fitting names.)

I'd vote for (e) if the slate was clean, but since it's not:

+1 for (a) or (b)

-- 
Pauli Virtanen