[Numpy-discussion] parsing text strings/files in fromfile, fromstring

Sun May 24 23:07:11 EDT 2009

On Sun, May 24, 2009 at 5:28 PM, Pauli Virtanen <pav at iki.fi> wrote:

> Sun, 24 May 2009 14:29:42 -0600, Charles R Harris wrote:
> > I am trying to put together some rule for parsing text strings/files in
> > fromfile, fromstring so that the two are consistent. Tickets relevant to
> > this are #1116 <http://projects.scipy.org/numpy/ticket/1116> and
> > #883<http://projects.scipy.org/numpy/ticket/883>. The question here is
> > the interpretation of the separators, not the parsing of the numbers
> > themselves. Below is the current behavior of fromstring, fromfile, and
> > python split for content of "", "1", "1 1", " " respectively.
>
> It should return only the data that's in the file, no extra elements. The
> current behavior is a bug, IMHO, especially so since the default value is
> uninitialized IIRC.
>
> So,
>
>        fromstring("", sep=" ") -> []
>        fromstring(" ", sep=" ") -> []
>        fromstring("1 ", sep=" ") -> [1]
>
> fromfile should behave identically.
>
> Another question is perhaps what to do with malformed input: whether
> to try best-efforts parsing, or bail  out. I'd suggest bailing out
> when encountering bad data rather than guessing:
>
>        fromstring("1,2,,3", sep=",") -> [1,2] or ValueError
>
> Currently, something horrible happens:
>
>        >>> np.fromstring('1,2,,3,,,6', sep=',')
>        array([ 1.,  2., -1.,  3., -1., -1.,  6.])
>
>
> Also, on second thoughts, the idea about raising a warning on malformed
> input seems more repulsive the more I think about it. Warnings are a bit
> nasty to catch, spam stderr if uncaught, and IMHO should not be a part of
> "business as usual" code paths. Having malformed input is business as
> usual :)
>
> In some sense, it would be simpler if `fromfile` and `fromstring` would
> be defined so that they read *at most* `count` entries, and return what
> they got by parsing the leftmost valid part. This could be implemented by
> fixing the current bugs and removing the fprintf that currently prints to
> stderr there.
>
> As an addition, a flag could be added that forces them to raise a
> ValueError on malformed input (eg. EOF when `count` was given, or bad
> separator encountered). Ideally, the exceptions flag would be True by
> default both for fromfile and fromstring, but I guess some legacy
> applications might rely on the current behavior...
>
> Also, one could envision a "default" value that would denote a batch of
> malformed input...
>
>   ***
>
> So, I see a couple of alternatives (some already suggested):
>
> a) fromstring("1,2,x,4", sep=",") -> [1,2]
>   fromstring("1,2,x,4", sep=",", strict=True) -> ValueError
>   fromstring("1,2,x,4", sep=",", count=5) -> [1,2]
>   fromstring("1,2,x,4", sep=",", count=5, strict=True) -> ValueError
>
> b) fromstring("1,2,x,4", sep=",") -> [1,2]
>   fromstring("1,2,x,4", sep=",", strict=True) -> ValueError
>   fromstring("1,2,x,4", sep=",", default=3) -> [1,2,3,4]
>   fromstring("1,2,x,4", sep=",", count=5) -> [1,2]
>   fromstring("1,2,x,4", sep=",", count=5, strict=True) -> ValueError
>
> c) fromstring("1,2,x,4", sep=",") -> [1,2] + SomeWarning
>   fromstring("1,2,x,4", sep=",", count=5) -> [1,2] + SomeWarning
>
> d) fromstring("1,2,x,4", sep=",") -> [1,2] + SomeWarning
>   fromstring("1,2,x,4", sep=",", default=3) -> [1,2,3,4]
>   fromstring("1,2,x,4", sep=",", default=3, count=5) -> [1,2,3,4] +
> SomeWarning
>
> e) fromstring("1,2,x,4", sep=",") -> ValueError
>   fromstring("1,2,x,4", sep=",", strict=False) -> [1,2]
>   fromstring("1,2,x,4", sep=",", count=5) -> ValueError
>   fromstring("1,2,x,4", sep=",", count=5, strict=False) -> [1,2]
>
> Fromfile would always behave the same way as `fromstring(file.read())`.

I think a common behavior is basic to whatever we end up with.

>
> In the above, " " in sep would equal the regexp \w+, and binary data
> implied by sep='' would be interpreted in the same way it would if first
> converted to comma-separated text.
>
> Can you think of any other alternatives? (Let's forget the names of
> the new keyword arguments for the present, and assume they have
> perfectly fitting names.)
>
>
> I'd vote for (e) if the slate was clean, but since it's not:
>
> +1 for (a) or (b)
>

(a) and (e) are the simplest and just differ in the default, so that would
be the shortest path. OTOH, (b) is the most general and the default is a
nice idea. Hmm...

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090524/e02c5d33/attachment.html>