All, Could you try r7449 ? I introduced some mechanisms to keep track of invalid lines (where the number of columns don't match what's expected). By default, a warning is emitted and these lines are skipped, but an optional argument gives the possibility to raise an exception instead. Now, I need more tests about wrong converters. I'm trying to optimize the upgrade mechanism (there are too many intertwined loops for my taste now), I'll keep you posted. Meanwhile, if you could come with more cases of failure, please send them my way. Cheers P.
On 10/05/2009 02:13 PM, Pierre GM wrote:
All, Could you try r7449 ? I introduced some mechanisms to keep track of invalid lines (where the number of columns don't match what's expected). By default, a warning is emitted and these lines are skipped, but an optional argument gives the possibility to raise an exception instead. Now, I need more tests about wrong converters. I'm trying to optimize the upgrade mechanism (there are too many intertwined loops for my taste now), I'll keep you posted. Meanwhile, if you could come with more cases of failure, please send them my way. Cheers P. _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi, Excellent as the changes appear to address incorrect number of delimiters. I think that the default invalid_raise should be True. One 'feature' is that there is no way to indicate multiple delimiters when the delimiter is whitespace. A B C D 1 2 3 4 1 4 5 Which I consider a user beware issue when using whitespace as the delimiter especially in Python. Bruce
On Oct 6, 2009, at 2:42 PM, Bruce Southey wrote:
Hi, Excellent as the changes appear to address incorrect number of delimiters.
They should also give some extra info if there's a problem w/ the converters.
I think that the default invalid_raise should be True.
Mmh, OK, that's a +1/) for invalid_raise=true. Anybody else ?
One 'feature' is that there is no way to indicate multiple delimiters when the delimiter is whitespace. A B C D 1 2 3 4 1 4 5
Have you tried using a sequence of integers for the delimiter ? Would you mind sending me some test ?
Pierre GM wrote:
I think that the default invalid_raise should be True.
Mmh, OK, that's a +1/) for invalid_raise=true. Anybody else ?
yup -- make it +2 -- ignoring erreos and losing data by default is a "bad idea"!
One 'feature' is that there is no way to indicate multiple delimiters when the delimiter is whitespace. A B C D 1 2 3 4 1 4 5
I'd say someone has made a very poor choice of file formats! Unless this s a fixed width file, in which case it should be processes as such, rather than as a delimited one. I suppose it wouldn't hurt to add that feature to genfromtxt.. or is it there already. Perhaps that's what this means:
Have you tried using a sequence of integers for the delimiter ?
-Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Oct 6, 2009, at 4:43 PM, Christopher Barker wrote:
Pierre GM wrote:
I think that the default invalid_raise should be True.
Mmh, OK, that's a +1/) for invalid_raise=true. Anybody else ?
yup -- make it +2 -- ignoring erreos and losing data by default is a "bad idea"!
OK then, that's enough for me: I'll put invalid_raise as True by default. Note that a warning was emitted no matter what.
One 'feature' is that there is no way to indicate multiple delimiters when the delimiter is whitespace. A B C D 1 2 3 4 1 4 5
I'd say someone has made a very poor choice of file formats!
Unless this s a fixed width file, in which case it should be processes as such, rather than as a delimited one. I suppose it wouldn't hurt to add that feature to genfromtxt.. or is it there already. Perhaps that's what this means:
Have you tried using a sequence of integers for the delimiter ?
Yes, if you give a sequence of integers as delimiter, it is interpreted as the length of each field. At least, should be.
On Tue, Oct 6, 2009 at 4:04 PM, Pierre GM
On Oct 6, 2009, at 4:43 PM, Christopher Barker wrote:
Pierre GM wrote:
I think that the default invalid_raise should be True.
Mmh, OK, that's a +1/) for invalid_raise=true. Anybody else ?
yup -- make it +2 -- ignoring erreos and losing data by default is a "bad idea"!
OK then, that's enough for me: I'll put invalid_raise as True by default. Note that a warning was emitted no matter what.
One 'feature' is that there is no way to indicate multiple delimiters when the delimiter is whitespace. A B C D 1 2 3 4 1 4 5
I'd say someone has made a very poor choice of file formats!
No, just seeing what sort of problems I can create. This case is partly based on if someone is using tab-delimited then they need to set the delimiter='\t' otherwise it gives an error. Also I often parse text files so, yes, you have to be careful of the delimiters. It is also arises because certain programs like spreadsheets there is the option to merge delimiters - actually in SAS it is default (you need to specify the DSD option).
Unless this s a fixed width file, in which case it should be processes as such, rather than as a delimited one. I suppose it wouldn't hurt to add that feature to genfromtxt.. or is it there already. Perhaps that's what this means:
Have you tried using a sequence of integers for the delimiter ?
Yes, if you give a sequence of integers as delimiter, it is interpreted as the length of each field. At least, should be.
More to learn and test. Anyhow, I am really impressed on how this function works. Bruce
On Oct 6, 2009, at 10:08 PM, Bruce Southey wrote:
No, just seeing what sort of problems I can create. This case is partly based on if someone is using tab-delimited then they need to set the delimiter='\t' otherwise it gives an error. Also I often parse text files so, yes, you have to be careful of the delimiters. It is also arises because certain programs like spreadsheets there is the option to merge delimiters - actually in SAS it is default (you need to specify the DSD option).
Ahah! I get it. Well, I remmbr that we discussed something like that a few months ago when I started working on np.genfromtxt, and the default of *not* merging whitespaces was requested. I gonna check whether we can't put this option somewhere now...
Anyhow, I am really impressed on how this function works.
Thx. I hope things haven't been slowed down too much.
On Tue, Oct 6, 2009 at 10:27 PM, Pierre GM
Anyhow, I am really impressed on how this function works.
Thx. I hope things haven't been slowed down too much.
In keeping with the making some work for you theme, I filed an enhancement ticket for one change that we discussed and another IMO useful addition. http://projects.scipy.org/numpy/ticket/1238 I think it would be nice if we could do data = np.genfromtxt(SomeFile, dtype=float, names = ['var1', 'var2', 'var3' ...]) So that float is paired with each variable name. Also, the one that came up earlier of data = np.genfromtxt(SomeFile, dtype=(int, int, float), names = ['var1','var2','var3'] I'm not completely convinced on this one though, since dtype = "i8,i8,f8" works. I don't want know how much confusion it would add to have the dtype argument accept a non-valid dtype construction. Skipper PS. Is it bad form for me to go ahead and assign these kinds of tickets to you if you're going to be working on them, or do you get pinged when any ticket is filed?
On Oct 6, 2009, at 11:01 PM, Skipper Seabold wrote:
In keeping with the making some work for you theme, I filed an enhancement ticket for one change that we discussed and another IMO useful addition. http://projects.scipy.org/numpy/ticket/1238
I think it would be nice if we could do
data = np.genfromtxt(SomeFile, dtype=float, names = ['var1', 'var2', 'var3' ...])
So that float is paired with each variable name. Also, the one that came up earlier of
data = np.genfromtxt(SomeFile, dtype=(int, int, float), names = ['var1','var2','var3']
I'm not completely convinced on this one though, since dtype = "i8,i8,f8" works. I don't want know how much confusion it would add to have the dtype argument accept a non-valid dtype construction.
Actually, it's rather straightforward. I already have something that supports dtype=(int,int,float) (far easier to handle than "i4,i4,f8"), I need to tweak a couple of things when the names don't match before posting. Pairing the names with the dtype is pretty neat, that would be quite easy to implement
PS. Is it bad form for me to go ahead and assign these kinds of tickets to you if you're going to be working on them, or do you get pinged when any ticket is filed?
Go for it. I'm only notified when a ticket is assigned to me directly.
Pierre GM wrote:
On Oct 6, 2009, at 10:08 PM, Bruce Southey wrote:
option to merge delimiters - actually in SAS it is default
Wow! that sure strikes me as a bad choice.
Ahah! I get it. Well, I remember that we discussed something like that a few months ago when I started working on np.genfromtxt, and the default of *not* merging whitespaces was requested. I gonna check whether we can't put this option somewhere now...
I'd think you might want to have two options: either "whitespace" which would be any type or amount of whitespace, or a specific delimeter: say "\t" or " " or " " (two spaces), etc. In that case, it would mean "one and only one of these". Of course, this would fail in Bruce's example:
A B C D 1 2 3 4 1 4 5
as there is a space for the delimeter, and one for the data! This looks like fixed-format to me. if it were single-space delimited, it would look more like: when the delimiter is whitespace. A B C D E 1 2 3 4 5 1 4 5 which is the same as: A, B, C, D, E 1, 2, 3, 4, 5 1, , , 4, 5 If something like SAS actually does merge decimeters, which I interpret to mean that if there are a few empty fields and you call for tab-delimited , you only get one tab, then information as simply been lost -- there is no way to recover it! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 10/07/2009 02:14 PM, Christopher Barker wrote:
Pierre GM wrote:
On Oct 6, 2009, at 10:08 PM, Bruce Southey wrote:
option to merge delimiters - actually in SAS it is default
Wow! that sure strikes me as a bad choice.
Ahah! I get it. Well, I remember that we discussed something like that a few months ago when I started working on np.genfromtxt, and the default of *not* merging whitespaces was requested. I gonna check whether we can't put this option somewhere now...
I'd think you might want to have two options: either "whitespace" which would be any type or amount of whitespace, or a specific delimeter: say "\t" or " " or " " (two spaces), etc. In that case, it would mean "one and only one of these".
Of course, this would fail in Bruce's example:
A B C D 1 2 3 4 1 4 5
as there is a space for the delimeter, and one for the data! This looks like fixed-format to me. if it were single-space delimited, it would look more like:
when the delimiter is whitespace. A B C D E 1 2 3 4 5 1 4 5
which is the same as:
A, B, C, D, E 1, 2, 3, 4, 5 1, , , 4, 5
If something like SAS actually does merge decimeters, which I interpret to mean that if there are a few empty fields and you call for tab-delimited , you only get one tab, then information as simply been lost -- there is no way to recover it!
-Chris
To use fixed length fields you really need nicely formatted data and I usually do not have that. As a default it does not always work for non-whitespace delimiters such as: A,B,C ,,1 1,2,3 There is an option to override that behavior. But it is very useful when you have extra whitespace especially reading in text strings that have different lengths or different levels of whitespace padding. The following is correct in that Python does merge whitespace delimiters by default. This is also what SAS does by default for any delimiter. But it is incorrect if each whitespace character is a delimiter: s = StringIO(''' 1 10 100\r\n 10 1 1000''') np.genfromtxt(s) array([[ 1., 10., 100.], [ 10., 1., 1000.]]) np.genfromtxt(s, delimiter=' ') Traceback (most recent call last): File "<stdin>", line 1, in<module> File "/usr/lib64/python2.6/site-packages/numpy/lib/io.py", line 1048, in genfromtxt raise IOError('End-of-file reached before encountering data.') IOError: End-of-file reached before encountering data. Anyhow, I do like what genfromtxt is doing so merging multiple delimiters of the same type is not really needed. Bruce
On Oct 7, 2009, at 3:54 PM, Bruce Southey wrote:
Anyhow, I do like what genfromtxt is doing so merging multiple delimiters of the same type is not really needed.
Thinking about it, merging multiple delimiters of the same type can be tricky: how do you distinguish between, say, "AAA\t\tCCC" where you expect 2 fields and "AAA\t\tCCC" where you expect 3 fields but the second one is missing ? I think 'genfromtxt' works consistently right now (but of course, as soon as I say that we'll find some counter-examples), so let's not break it. Yet.
On Tue, Oct 6, 2009 at 10:08 PM, Bruce Southey
On Tue, Oct 6, 2009 at 4:04 PM, Pierre GM
wrote: On Oct 6, 2009, at 4:43 PM, Christopher Barker wrote:
Pierre GM wrote:
I think that the default invalid_raise should be True.
Mmh, OK, that's a +1/) for invalid_raise=true. Anybody else ?
yup -- make it +2 -- ignoring erreos and losing data by default is a "bad idea"!
OK then, that's enough for me: I'll put invalid_raise as True by default. Note that a warning was emitted no matter what.
One 'feature' is that there is no way to indicate multiple delimiters when the delimiter is whitespace. A B C D 1 2 3 4 1 4 5
I'd say someone has made a very poor choice of file formats!
No, just seeing what sort of problems I can create. This case is partly based on if someone is using tab-delimited then they need to set the delimiter='\t' otherwise it gives an error. Also I often parse text files so, yes, you have to be careful of the delimiters. It is also arises because certain programs like spreadsheets there is the option to merge delimiters - actually in SAS it is default (you need to specify the DSD option).
Unless this s a fixed width file, in which case it should be processes as such, rather than as a delimited one. I suppose it wouldn't hurt to add that feature to genfromtxt.. or is it there already. Perhaps that's what this means:
Have you tried using a sequence of integers for the delimiter ?
Yes, if you give a sequence of integers as delimiter, it is interpreted as the length of each field. At least, should be.
More to learn and test.
There's an example on using the fixed-width delimiter here: http://docs.scipy.org/numpy/docs/numpy.lib.io.genfromtxt/ As far as I know, it works fine.
Anyhow, I am really impressed on how this function works.
Agreed. Genfromtxt and the derived are very useful. Skipper
participants (4)
-
Bruce Southey
-
Christopher Barker
-
Pierre GM
-
Skipper Seabold