Re: [Numpy-discussion] genfromtxt - the return

7 Oct 2009

      On 10/07/2009 02:14 PM, Christopher Barker wrote:
...
Pierre GM wrote:
...
On Oct 6, 2009, at 10:08 PM, Bruce Southey wrote:
...
option to merge delimiters - actually in SAS it is default
Wow! that sure strikes me as a bad choice.
...
Ahah! I get it. Well, I remember that we discussed something like that a
few months ago when I started working on np.genfromtxt, and the
default of *not* merging whitespaces was requested. I gonna check
whether we can't put this option somewhere now...
I'd think you might want to have two options: either "whitespace" which
would be any type or amount of whitespace, or a specific delimeter: say
"\t" or " " or "  " (two spaces), etc. In that case, it would mean "one
and only one of these".
Of course, this would fail in Bruce's example:
...
...
...
...
A B C D
 1 2 3 4
 1     4 5
as there is a space for the delimeter, and one for the data! This looks
like fixed-format to me. if it were single-space delimited, it would
look more like:
when the delimiter is whitespace.
A B C D E
1 2 3 4 5
1   4 5
which is the same as:
A, B, C, D, E
1, 2, 3, 4, 5
1,  ,  , 4, 5
If something like SAS actually does merge decimeters, which I interpret
to mean that if there are a few empty fields and you call for
tab-delimited , you only get one tab, then information as simply been
lost -- there is no way to recover it!
-Chris
To use fixed length fields you really need nicely formatted data and I
usually do not have that. As a default it does not always work for non-whitespace delimiters such as:
A,B,C
,,1
1,2,3

There is an option to override that behavior. But it is very useful when you have
extra whitespace especially reading in text strings that have different
lengths or different levels of whitespace padding.

The following is correct in that Python does merge whitespace delimiters by default. This is also what SAS does by default for any delimiter. But it is incorrect if each whitespace character is a delimiter:

s = StringIO('''
  1 10 100\r\n
10  1 1000''')
np.genfromtxt(s)
array([[    1.,    10.,   100.],
        [   10.,     1.,  1000.]])

np.genfromtxt(s, delimiter=' ')
Traceback (most recent call last):
   File "<stdin>", line 1, in<module>
   File "/usr/lib64/python2.6/site-packages/numpy/lib/io.py", line 1048, in genfromtxt
     raise IOError('End-of-file reached before encountering data.')
IOError: End-of-file reached before encountering data.

Anyhow, I do like what genfromtxt is doing so merging multiple delimiters of the same type is not really needed.

Bruce