Re: [Numpy-discussion] np.loadtxt : yet a new implementation...

Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message.

On Mon, Dec 1, 2008 at 12:21 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message.\
It looks like I am doing something wrong -- trying to parse a CSV file with dates formatted like '2008-10-14', with:: import datetime, sys import dateutil.parser StringConverter.upgrade_mapper(dateutil.parser.parse, default=datetime.date(1900,1,1)) r = loadtxt(sys.argv[1], delimiter=',', names=True) print r.dtype I get the following:: Traceback (most recent call last): File "genload_proposal.py", line 734, in ? r = loadtxt(sys.argv[1], delimiter=',', names=True) File "genload_proposal.py", line 711, in loadtxt (output, _) = genloadtxt(fname, **kwargs) File "genload_proposal.py", line 646, in genloadtxt rows[i] = tuple([conv(val) for (conv, val) in zip(converters, vals)]) File "genload_proposal.py", line 385, in __call__ raise ValueError("Cannot convert string '%s'" % value) ValueError: Cannot convert string '2008-10-14' In debug mode, I see the following where the error occurs ipdb> vals ('2008-10-14', '116.26', '116.40', '103.14', '104.08', '70749800', '104.08') ipdb> converters [<__main__.StringConverter instance at 0xa35fa6c>, <__main__.StringConverter instance at 0xa35ff2c>, <__main__.StringConverter instance at 0xa35ff8c>, <__main__.StringConverter instance at 0xa35ffec>, <__main__.StringConverter instance at 0xa15406c>, <__main__.StringConverter instance at 0xa1540cc>, <__main__.StringConverter instance at 0xa15412c>] It looks like my registry of a custom converter isn't working. Here is what the _mapper looks like:: In [23]: StringConverter._mapper Out[23]: [(<type 'numpy.bool_'>, <function str2bool at 0xa2b8bc4>, None), (<type 'numpy.integer'>, <type 'int'>, -1), (<type 'numpy.floating'>, <type 'float'>, -NaN), (<type 'complex'>, <type 'complex'>, (-NaN+0j)), (<type 'numpy.object_'>, <function parse at 0x8cf1534>, datetime.date(1900, 1, 1)), (<type 'numpy.string_'>, <type 'str'>, '???')]

Hi Pierre, I've tested the new loadtxt briefly. Looks good, except that there's a minor bug when trying to use a specific white-space delimiter (e.g. \t) while still allowing other white-space to be allowed in fields (e.g. spaces). Specifically, on line 115 in LineSplitter, we have: self.delimiter = delimiter.strip() or None so if I pass in, say, '\t' as the delimiter, self.delimiter gets set to None, which then causes the default behavior of any-whitespace-is- delimiter to be used. This makes lines like "Gene Name\tPubMed ID \tStarting Position" get split wrong, even when I explicitly pass in '\t' as the delimiter! Similarly, I believe that some of the tests are formulated wrong: def test_nodelimiter(self): "Test LineSplitter w/o delimiter" strg = " 1 2 3 4 5 # test" test = LineSplitter(' ')(strg) assert_equal(test, ['1', '2', '3', '4', '5']) I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5'] At least, that's what I would expect. Treating contiguous blocks of whitespace as single delimiters is perfectly reasonable when None is provided as the delimiter, but when an explicit delimiter has been provided, it strikes me that the code shouldn't try to further- interpret it... Does anyone else have any opinion here? Zach On Dec 1, 2008, at 1:21 PM, Pierre GM wrote:
Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message.
<genload_proposal.py> _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Zachary Pincus wrote:
Specifically, on line 115 in LineSplitter, we have: self.delimiter = delimiter.strip() or None so if I pass in, say, '\t' as the delimiter, self.delimiter gets set to None, which then causes the default behavior of any-whitespace-is- delimiter to be used. This makes lines like "Gene Name\tPubMed ID \tStarting Position" get split wrong, even when I explicitly pass in '\t' as the delimiter!
Similarly, I believe that some of the tests are formulated wrong: def test_nodelimiter(self): "Test LineSplitter w/o delimiter" strg = " 1 2 3 4 5 # test" test = LineSplitter(' ')(strg) assert_equal(test, ['1', '2', '3', '4', '5'])
I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5']
At least, that's what I would expect. Treating contiguous blocks of whitespace as single delimiters is perfectly reasonable when None is provided as the delimiter, but when an explicit delimiter has been provided, it strikes me that the code shouldn't try to further- interpret it...
Does anyone else have any opinion here?
I agree. If the user explicity passes something as a delimiter, we should use it and not try to be too smart. +1 Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma

Pierre GM wrote:
Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message.
A couple of quick nitpicks: 1) On line 186 (in the NameValidator class), you use excludelist.append() to append a list to the end of a list. I think you meant to use excludelist.extend() 2) When validating a list of names, why do you insist on lower casing them? (I'm referring to the call to lower() on line 207). On one hand, this would seem nicer than all upper case, but on the other hand this can cause confusion for someone who sees certain casing of names in the file and expects that data to be laid out the same. Other than those, it's working fine for me here. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma

On Dec 2, 2008, at 3:12 PM, Ryan May wrote:
Pierre GM wrote:
Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message.
A couple of quick nitpicks:
1) On line 186 (in the NameValidator class), you use excludelist.append() to append a list to the end of a list. I think you meant to use excludelist.extend()
Good call.
2) When validating a list of names, why do you insist on lower casing them? (I'm referring to the call to lower() on line 207). On one hand, this would seem nicer than all upper case, but on the other hand this can cause confusion for someone who sees certain casing of names in the file and expects that data to be laid out the same.
I recall a life where names were case-insensitives, so 'dates' and 'Dates' and 'DATES' were the same field. It should be easy enough to get rid of that limitations, or add a parameter for case-sensitivity On Dec 2, 2008, at 2:47 PM, Zachary Pincus wrote:
Specifically, on line 115 in LineSplitter, we have: self.delimiter = delimiter.strip() or None so if I pass in, say, '\t' as the delimiter, self.delimiter gets set to None, which then causes the default behavior of any-whitespace-is- delimiter to be used. This makes lines like "Gene Name\tPubMed ID \tStarting Position" get split wrong, even when I explicitly pass in '\t' as the delimiter!
OK, I'll check that.
I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5']
Valid point. Well, all, stay tuned for yet another "yet another implementation..."
Other than those, it's working fine for me here.
Ryan

Pierre GM wrote:
I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5']
Valid point. Well, all, stay tuned for yet another "yet another implementation..."
While we're at it, it might be nice to be able to pass in more than one delimiter: ('\t',' '). though maybe that only combination that I'd really want would be something and '\n', which I think is being treated specially already. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Chris, I can try, but in that case, please write me a unittest, so that I have a clear and unambiguous idea of what you expect. ANFSCD, have you tried the missing_values option ? On Dec 2, 2008, at 5:36 PM, Christopher Barker wrote:
Pierre GM wrote:
I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5']
Valid point. Well, all, stay tuned for yet another "yet another implementation..."
While we're at it, it might be nice to be able to pass in more than one delimiter: ('\t',' '). though maybe that only combination that I'd really want would be something and '\n', which I think is being treated specially already.
-Chris
-- Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Pierre GM wrote:
I can try, but in that case, please write me a unittest, so that I have a clear and unambiguous idea of what you expect.
fair enough, though I'm not sure when I'll have time to do it. I do wonder if anyone else thinks it would be useful to have multiple delimiters as an option. I got the idea because with fromfile(), if you specify, say ',' as the delimiter, it won't use '\n', only a comma, so there is no way to quickly read a whole bunch of comma delimited data like: 1,2,3,4 5,6,7,8 .... so I'd like to be able to say to use either ',' or '\n' as the delimiter. However, if I understand loadtxt() correctly, it's handling the new lines separately anyway (to get a 2-d array), so this use case isn't an issue. So how likely is it that someone would have: 1 2 3, 4, 5 6 7 8, 8, 9 and want to read that into a single 2-d array? I'm not sure I've seen it. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Dec 3, 2008, at 12:48 PM, Christopher Barker wrote:
Pierre GM wrote:
I can try, but in that case, please write me a unittest, so that I have a clear and unambiguous idea of what you expect.
fair enough, though I'm not sure when I'll have time to do it.
f = StringIO('stid stnm relh tair\nnrmn 121 45 9.1') test = loadtxt(f, usecols=('stid', 'relh', 'tair'), names=True,
Oh, don;t worry, nothing too fancy: give me a couple lines of input data and a line with what you expect. Using Ryan's recent example: dtype=None)
control=array(('nrmn', 45, 9.0999999999999996), dtype=[('stid', '|S4'), ('relh', '<i8'), ('tair', '<f8')])
That's quite enough for a test.
I do wonder if anyone else thinks it would be useful to have multiple delimiters as an option. I got the idea because with fromfile(), if you specify, say ',' as the delimiter, it won't use '\n', only a comma, so there is no way to quickly read a whole bunch of comma delimited data like:
1,2,3,4 5,6,7,8 ....
so I'd like to be able to say to use either ',' or '\n' as the delimiter.
I'm not quite sure I follow you. Do you want to delimiters, one for the field of a record (','), one for the records ("\n") ?
However, if I understand loadtxt() correctly, it's handling the new lines separately anyway (to get a 2-d array), so this use case isn't an issue. So how likely is it that someone would have:
1 2 3, 4, 5 6 7 8, 8, 9
and want to read that into a single 2-d array?
With the current behaviour, you gonna have [("1 2 3", 4, 5), ("6 7 8", 8, 9)] if you use "," as a delimiter, [(1,2,"3,","4,",5),(6,7,"8,","8,",9)] if you use " " as a delimiter. Mixing delimiter is doable, but I don't think it's that a good idea. I'm in favor of sticking to one and only field delimiter, and the default line spearator for record delimiter. In other terms, not changing anythng.

Pierre GM wrote:
Oh, don;t worry, nothing too fancy: give me a couple lines of input data and a line with what you expect.
I just went and looked at the existing tests, and you're right, it's very easy -- my first foray into the new nose tests -- very nice!
specify, say ',' as the delimiter, it won't use '\n', only a comma, so there is no way to quickly read a whole bunch of comma delimited data like:
1,2,3,4 5,6,7,8 ....
so I'd like to be able to say to use either ',' or '\n' as the delimiter.
I'm not quite sure I follow you. Do you want to delimiters, one for the field of a record (','), one for the records ("\n") ?
well, in the case of fromfile(), it doesn't "do" records -- it will only give you a 1-d array, so I want it all as a flat array, and you can re-size it yourself later. Clearly this is more work (and requires more knowledge of your data) than using loadtxt, but sometimes I really want FAST data reading of simple formats. However, this isn't fromfile() we are talking about now, it's loadtxt()...
So how likely is it that someone would have:
1 2 3, 4, 5 6 7 8, 8, 9
and want to read that into a single 2-d array?
With the current behaviour, you gonna have [("1 2 3", 4, 5), ("6 7 8", 8, 9)] if you use "," as a delimiter, [(1,2,"3,","4,",5),(6,7,"8,","8,",9)] if you use " " as a delimiter.
right.
Mixing delimiter is doable, but I don't think it's that a good idea.
I can't come up with a use case at this point, so..
I'm in favor of sticking to one and only field delimiter, and the default line spearator for record delimiter. In other terms, not changing anything.
I agree -- sorry for the noise! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

by the way, should this work: io.loadtxt('junk.dat', delimiter=' ') for more than one space between numbers, like: 1 2 3 4 5 6 7 8 9 10 I get: io.loadtxt('junk.dat', delimiter=' ') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/numpy/lib/io.py", line 403, in loadtxt X.append(tuple([conv(val) for (conv, val) in zip(converters, vals)])) ValueError: empty string for float() with the current version.
io.loadtxt('junk.dat', delimiter=None) array([[ 1., 2., 3., 4., 5.], [ 6., 7., 8., 9., 10.]])
does work. -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Dec 3, 2008, at 1:00 PM, Christopher Barker wrote:
by the way, should this work:
io.loadtxt('junk.dat', delimiter=' ')
for more than one space between numbers, like:
1 2 3 4 5 6 7 8 9 10
On the version I'm working on, both delimiter='' and delimiter=None (default) would give you the expected output. delimiter=' ' would fail, delimiter=' ' would work.

Pierre GM wrote:
On Dec 3, 2008, at 1:00 PM, Christopher Barker wrote:
for more than one space between numbers, like:
1 2 3 4 5 6 7 8 9 10
On the version I'm working on, both delimiter='' and delimiter=None (default) would give you the expected output.
so empty string and None both mean "any white space"? also tabs, etc?
delimiter=' ' would fail,
s only exactly that delimiter. Is that so things like '\t' will work right? but what about: 4, 5, 34,123, .... In that case, ',' is the delimiter, but whitespace is ignored. or 4\t 5\t 34\t 123. we're ignoring extra whitespace there, too, so I'm not sure why we shouldn't ignore it in the ' ' case also. delimiter=' ' would work. but in my example, there were sometimes two spaces, sometimes three -- so I think it would fail, no?
"1 2 3 4 5".split(' ') ['1', '2', '3', '4', ' 5']
actually, that would work, but four spaces wouldn't.
"1 2 3 4 5".split(' ') ['1', '2', '3', '4', '', '5']
I guess the solution is to use delimiter=None in that case, and is does make sense that you can't have ' ' mean "one or more spaces", but "\t" mean "only one tab". -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Pierre GM wrote:
I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5']
Valid point. Well, all, stay tuned for yet another "yet another implementation..."
Found a problem. If you read the names from the file and specify usecols, you end up with the first N names read from the file as the fields in your output (where N is the number of entries in usecols), instead of having the names of the columns you asked for. For instance:
from StringIO import StringIO from genload_proposal import loadtxt f = StringIO('stid stnm relh tair\nnrmn 121 45 9.1') loadtxt(f, usecols=('stid', 'relh', 'tair'), names=True, dtype=None) array(('nrmn', 45, 9.0999999999999996), dtype=[('stid', '|S4'), ('stnm', '<i8'), ('relh', '<f8')])
What I want to come out is: array(('nrmn', 45, 9.0999999999999996), dtype=[('stid', '|S4'), ('relh', '<i8'), ('tair', '<f8')]) I've attached a version that fixes this by setting a flag internally if the names are read from the file. If this flag is true, at the end the names are filtered down to only the ones that are given in usecols. I also have one other thought. Is there any way we can make this handle object arrays, or rather, a field containing objects, specifically datetime objects? Right now, this does not work because calling view does not work for object arrays. I'm just looking for a simple way to store date/time in my record array (currently a string field). Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma

On Dec 3, 2008, at 11:41 AM, Ryan May wrote:
Found a problem. If you read the names from the file and specify usecols, you end up with the first N names read from the file as the fields in your output (where N is the number of entries in usecols), instead of having the names of the columns you asked for.
<..>
I've attached a version that fixes this by setting a flag internally if the names are read from the file. If this flag is true, at the end the names are filtered down to only the ones that are given in usecols.
OK, thx. I'll take that into account and post a new version by the end of the day.
I also have one other thought. Is there any way we can make this handle object arrays, or rather, a field containing objects, specifically datetime objects? Right now, this does not work because calling view does not work for object arrays. I'm just looking for a simple way to store date/time in my record array (currently a string field).
It does already: you can upgrade the mapper of StringConverter to support datetime object. Check an earlier post by JDH and my answer. I'll add an example in the test suite.

If I know my data is already clean and is handled nicely by the old loadtxt, will I be able to turn off and the special handling in order to retain the old load speed? Alan Isaac

On Dec 3, 2008, at 12:32 PM, Alan G Isaac wrote:
If I know my data is already clean and is handled nicely by the old loadtxt, will I be able to turn off and the special handling in order to retain the old load speed?
Hopefully. I'm looking for the best way to do it. Do you have an example you could send me off-list so that I can play with timers ? Thx in advance. P.

Alan G Isaac wrote:
If I know my data is already clean and is handled nicely by the old loadtxt, will I be able to turn off and the special handling in order to retain the old load speed?
what I'd like to see is a version of loadtxt built on a slightly enhanced fromfile() -- that would be blazingly fast for the easy cases (simple tabular data of one dtype). I don't know if the special-casing should be automatic, or just have it be a separate function. Also, fromfile() needs some work, and it needs to be done in C, which is less fun, so who knows when it will get done. As I think about it, maybe what I really want is a simple version of loadtxt written in C: It would only handle one data type at a time. It would support simple comment lines. It would only support one delimiter (plus newline). It would create a 2-d array from normal, tabular data. You could specify: how many numbers you wanted, or how many rows, or read 'till EOF Actually, this is a lot like matlab's fscanf() someday.... -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Alan G Isaac wrote:
If I know my data is already clean and is handled nicely by the old loadtxt, will I be able to turn off and the special handling in order to retain the old load speed?
Alan Isaac
Hi all, that's going in the same direction I was thinking about. When I thought about an improved version of loadtxt, I wished it was fault tolerant without loosing too much performance. So my solution was much simpler than the very nice genloadtxt function -- and it works for me. My ansatz is to leave the existing loadtxt function unchanged. I only replaced the default converter calls by a fault tolerant converter class. I attached a patch against io.py in numpy 1.2.1 The nice thing is that it not only handles missing values, but for example also columns/fields with non-number characters. It just returns nan in these cases. This is of practical importance for many datafiles of astronomical catalogues, for example the Hipparcos catalogue data. Regarding the performance, it is a little bit slower than the original loadtxt, but not much: on my machine, 10x reading in a clean testfile with 3 columns and 20000 rows I get the following results: original loadtxt: ~1.3s modified loadtxt: ~1.7s new genloadtxt : ~2.7s So you see, there is some loss of performance, but not as much as with the new converter class. I hope this solution is of interest ... Manuel

Manuel Metz wrote:
Alan G Isaac wrote:
If I know my data is already clean and is handled nicely by the old loadtxt, will I be able to turn off and the special handling in order to retain the old load speed?
Alan Isaac
Hi all, that's going in the same direction I was thinking about. When I thought about an improved version of loadtxt, I wished it was fault tolerant without loosing too much performance. So my solution was much simpler than the very nice genloadtxt function -- and it works for me.
My ansatz is to leave the existing loadtxt function unchanged. I only replaced the default converter calls by a fault tolerant converter class. I attached a patch against io.py in numpy 1.2.1
The nice thing is that it not only handles missing values, but for example also columns/fields with non-number characters. It just returns nan in these cases. This is of practical importance for many datafiles of astronomical catalogues, for example the Hipparcos catalogue data.
Regarding the performance, it is a little bit slower than the original loadtxt, but not much: on my machine, 10x reading in a clean testfile with 3 columns and 20000 rows I get the following results:
original loadtxt: ~1.3s modified loadtxt: ~1.7s new genloadtxt : ~2.7s
So you see, there is some loss of performance, but not as much as with the new converter class.
I hope this solution is of interest ...
Manuel
Oops, wrong version of the diff file. Wanted to name the class "_faulttolerantconv" ...

Manuel, Looks nice, I gonna try to see how I can incorporate yours. Note that returning np.nan by default will not work w/ Python 2.6 if you want an int...
participants (7)
-
Alan G Isaac
-
Christopher Barker
-
John Hunter
-
Manuel Metz
-
Pierre GM
-
Ryan May
-
Zachary Pincus