On 10/07/2009 02:14 PM, Christopher Barker wrote:
Pierre GM wrote:
On Oct 6, 2009, at 10:08 PM, Bruce Southey wrote:
option to merge delimiters - actually in SAS it is default
Wow! that sure strikes me as a bad choice.
Ahah! I get it. Well, I remember that we discussed something like that a few months ago when I started working on np.genfromtxt, and the default of *not* merging whitespaces was requested. I gonna check whether we can't put this option somewhere now...
I'd think you might want to have two options: either "whitespace" which would be any type or amount of whitespace, or a specific delimeter: say "\t" or " " or " " (two spaces), etc. In that case, it would mean "one and only one of these".
Of course, this would fail in Bruce's example:
A B C D 1 2 3 4 1 4 5
as there is a space for the delimeter, and one for the data! This looks like fixed-format to me. if it were single-space delimited, it would look more like:
when the delimiter is whitespace. A B C D E 1 2 3 4 5 1 4 5
which is the same as:
A, B, C, D, E 1, 2, 3, 4, 5 1, , , 4, 5
If something like SAS actually does merge decimeters, which I interpret to mean that if there are a few empty fields and you call for tab-delimited , you only get one tab, then information as simply been lost -- there is no way to recover it!
-Chris
To use fixed length fields you really need nicely formatted data and I usually do not have that. As a default it does not always work for non-whitespace delimiters such as: A,B,C ,,1 1,2,3 There is an option to override that behavior. But it is very useful when you have extra whitespace especially reading in text strings that have different lengths or different levels of whitespace padding. The following is correct in that Python does merge whitespace delimiters by default. This is also what SAS does by default for any delimiter. But it is incorrect if each whitespace character is a delimiter: s = StringIO(''' 1 10 100\r\n 10 1 1000''') np.genfromtxt(s) array([[ 1., 10., 100.], [ 10., 1., 1000.]]) np.genfromtxt(s, delimiter=' ') Traceback (most recent call last): File "<stdin>", line 1, in<module> File "/usr/lib64/python2.6/site-packages/numpy/lib/io.py", line 1048, in genfromtxt raise IOError('End-of-file reached before encountering data.') IOError: End-of-file reached before encountering data. Anyhow, I do like what genfromtxt is doing so merging multiple delimiters of the same type is not really needed. Bruce