[Numpy-discussion] parsing text strings/files in fromfile, fromstring
Pauli Virtanen
pav at iki.fi
Sun May 24 19:28:05 EDT 2009
Sun, 24 May 2009 14:29:42 -0600, Charles R Harris wrote:
> I am trying to put together some rule for parsing text strings/files in
> fromfile, fromstring so that the two are consistent. Tickets relevant to
> this are #1116 <http://projects.scipy.org/numpy/ticket/1116> and
> #883<http://projects.scipy.org/numpy/ticket/883>. The question here is
> the interpretation of the separators, not the parsing of the numbers
> themselves. Below is the current behavior of fromstring, fromfile, and
> python split for content of "", "1", "1 1", " " respectively.
It should return only the data that's in the file, no extra elements. The
current behavior is a bug, IMHO, especially so since the default value is
uninitialized IIRC.
So,
fromstring("", sep=" ") -> []
fromstring(" ", sep=" ") -> []
fromstring("1 ", sep=" ") -> [1]
fromfile should behave identically.
Another question is perhaps what to do with malformed input: whether
to try best-efforts parsing, or bail out. I'd suggest bailing out
when encountering bad data rather than guessing:
fromstring("1,2,,3", sep=",") -> [1,2] or ValueError
Currently, something horrible happens:
>>> np.fromstring('1,2,,3,,,6', sep=',')
array([ 1., 2., -1., 3., -1., -1., 6.])
Also, on second thoughts, the idea about raising a warning on malformed
input seems more repulsive the more I think about it. Warnings are a bit
nasty to catch, spam stderr if uncaught, and IMHO should not be a part of
"business as usual" code paths. Having malformed input is business as
usual :)
In some sense, it would be simpler if `fromfile` and `fromstring` would
be defined so that they read *at most* `count` entries, and return what
they got by parsing the leftmost valid part. This could be implemented by
fixing the current bugs and removing the fprintf that currently prints to
stderr there.
As an addition, a flag could be added that forces them to raise a
ValueError on malformed input (eg. EOF when `count` was given, or bad
separator encountered). Ideally, the exceptions flag would be True by
default both for fromfile and fromstring, but I guess some legacy
applications might rely on the current behavior...
Also, one could envision a "default" value that would denote a batch of
malformed input...
***
So, I see a couple of alternatives (some already suggested):
a) fromstring("1,2,x,4", sep=",") -> [1,2]
fromstring("1,2,x,4", sep=",", strict=True) -> ValueError
fromstring("1,2,x,4", sep=",", count=5) -> [1,2]
fromstring("1,2,x,4", sep=",", count=5, strict=True) -> ValueError
b) fromstring("1,2,x,4", sep=",") -> [1,2]
fromstring("1,2,x,4", sep=",", strict=True) -> ValueError
fromstring("1,2,x,4", sep=",", default=3) -> [1,2,3,4]
fromstring("1,2,x,4", sep=",", count=5) -> [1,2]
fromstring("1,2,x,4", sep=",", count=5, strict=True) -> ValueError
c) fromstring("1,2,x,4", sep=",") -> [1,2] + SomeWarning
fromstring("1,2,x,4", sep=",", count=5) -> [1,2] + SomeWarning
d) fromstring("1,2,x,4", sep=",") -> [1,2] + SomeWarning
fromstring("1,2,x,4", sep=",", default=3) -> [1,2,3,4]
fromstring("1,2,x,4", sep=",", default=3, count=5) -> [1,2,3,4] + SomeWarning
e) fromstring("1,2,x,4", sep=",") -> ValueError
fromstring("1,2,x,4", sep=",", strict=False) -> [1,2]
fromstring("1,2,x,4", sep=",", count=5) -> ValueError
fromstring("1,2,x,4", sep=",", count=5, strict=False) -> [1,2]
Fromfile would always behave the same way as `fromstring(file.read())`.
In the above, " " in sep would equal the regexp \w+, and binary data
implied by sep='' would be interpreted in the same way it would if first
converted to comma-separated text.
Can you think of any other alternatives? (Let's forget the names of
the new keyword arguments for the present, and assume they have
perfectly fitting names.)
I'd vote for (e) if the slate was clean, but since it's not:
+1 for (a) or (b)
--
Pauli Virtanen
More information about the NumPy-Discussion
mailing list