loadtxt/savetxt tickets
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
Hi All, Could someone with an interest in loadtxt/savetxt look through the associated tickets? A search on the tickets using either of those keys will return fairly lengthy lists. Chuck
![](https://secure.gravatar.com/avatar/d98bd91ed2cea43056594b7ce5369b17.jpg?s=120&d=mm&r=g)
Hi! I have had a look at the list of numpy.loadtxt tickets. I have never contributed to numpy before, so I may be doing stupid things - don't be afraid to let me know! My opinions are my own, and in detail, they are: 1752: I attach a possible patch. FWIW, I agree with the request. The patch is written to be compatible with the fix in ticket #1562, but I did not test that yet. 1731: This seems like a rather trivial feature enhancement. I attach a possible patch. 1616: The suggested patch seems reasonable to me, but I do not have a full list of what objects loadtxt supports today as opposed to what this patch will support. 1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however. 1458: The fix suggested in the ticket seems reasonable, but I have never used record arrays, so I am not sure of this. 1445: Adding this functionality could break old code, as some old datafiles may have empty lines which are now simply ignored. I do not think the feature is a good idea. It could rather be implemented as a separate function. 1107: I do not see the need for this enhancement. In my eyes, the usecols kwarg does this and more. Perhaps I am misunderstanding something here. 1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket. 1163: 1565: These tickets seem to have the same origin of the problem. I attach one possible patch. The previously suggested patches that I've seen will not correctly convert floats to ints, which I believe my patch will. I hope you find this useful! Is there some way of submitting the patches for review in a more convenient fashion than e-mail? Cheers, Paul. On 25. mars 2011, at 16.06, Charles R Harris wrote:
Hi All,
Could someone with an interest in loadtxt/savetxt look through the associated tickets? A search on the tickets using either of those keys will return fairly lengthy lists.
Chuck _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
![](https://secure.gravatar.com/avatar/da3a0a1942fbdc5ee9a9b8115ac5dae7.jpg?s=120&d=mm&r=g)
Hi, Thanks! On Sat, 26 Mar 2011 13:11:46 +0100, Paul Anton Letnes wrote: [clip]
I hope you find this useful! Is there some way of submitting the patches for review in a more convenient fashion than e-mail?
You can attach them on the trac to each ticket. That way they'll be easy to find later on. Pauli
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
Hi, On 26 Mar 2011, at 14:36, Pauli Virtanen wrote:
On Sat, 26 Mar 2011 13:11:46 +0100, Paul Anton Letnes wrote: [clip]
I hope you find this useful! Is there some way of submitting the patches for review in a more convenient fashion than e-mail?
You can attach them on the trac to each ticket. That way they'll be easy to find later on.
I've got some comments on 1562, and I'd attach a revised patch then - just a general question: should I then change "Milestone" to 1.6.0 and "Version" to 'devel'?
1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code,
Seems the fastest solution unless someone wants to change numpy.squeeze as well. But the present patch does not call np.squeeze any more at all, so I propose to restore that behaviour for X.ndim > ndmin to remain really backwards compatible. It also seems easier to code when making the default ndmin=0. Cheers, Derek
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
Hi again, On 26 Mar 2011, at 15:20, Derek Homeier wrote:
1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code,
Seems the fastest solution unless someone wants to change numpy.squeeze as well. But the present patch does not call np.squeeze any more at all, so I propose to restore that behaviour for X.ndim > ndmin to remain really backwards compatible. It also seems easier to code when making the default ndmin=0.
I've got another somewhat general question: since it would probably be nice to have a test for this, I found one could simply add something along the lines of assert_equal(a.shape, x.shape) to test_io.py - test_shaped_dtype(self) or should one generally create a new test for such things (might still be better in this case, since test_shaped_dtype does not really test different ndim)? Cheers, Derek
![](https://secure.gravatar.com/avatar/d98bd91ed2cea43056594b7ce5369b17.jpg?s=120&d=mm&r=g)
Hi Derek! On 26. mars 2011, at 15.48, Derek Homeier wrote:
Hi again,
On 26 Mar 2011, at 15:20, Derek Homeier wrote:
1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code,
Seems the fastest solution unless someone wants to change numpy.squeeze as well. But the present patch does not call np.squeeze any more at all, so I propose to restore that behaviour for X.ndim > ndmin to remain really backwards compatible. It also seems easier to code when making the default ndmin=0.
I've got another somewhat general question: since it would probably be nice to have a test for this, I found one could simply add something along the lines of
assert_equal(a.shape, x.shape)
to test_io.py - test_shaped_dtype(self) or should one generally create a new test for such things (might still be better in this case, since test_shaped_dtype does not really test different ndim)?
Cheers, Derek
It would be nice to see your patch. I uploaded all of mine as mentioned. I'm no testing expert, but I am sure someone else will comment on it. Paul.
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On Sat, Mar 26, 2011 at 8:53 AM, Paul Anton Letnes < paul.anton.letnes@gmail.com> wrote:
Hi Derek!
On 26. mars 2011, at 15.48, Derek Homeier wrote:
Hi again,
On 26 Mar 2011, at 15:20, Derek Homeier wrote:
1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code,
Seems the fastest solution unless someone wants to change numpy.squeeze as well. But the present patch does not call np.squeeze any more at all, so I propose to restore that behaviour for X.ndim > ndmin to remain really backwards compatible. It also seems easier to code when making the default ndmin=0.
I've got another somewhat general question: since it would probably be nice to have a test for this, I found one could simply add something along the lines of
assert_equal(a.shape, x.shape)
to test_io.py - test_shaped_dtype(self) or should one generally create a new test for such things (might still be better in this case, since test_shaped_dtype does not really test different ndim)?
Cheers, Derek
It would be nice to see your patch. I uploaded all of mine as mentioned. I'm no testing expert, but I am sure someone else will comment on it.
I put all these patches together at https://github.com/charris/numpy/tree/loadtxt-savetxt. Please pull from there to continue work on loadtxt/savetxt so as to avoid conflicts in the patches. One of the numpy tests is failing, I assume from patch conflicts, and more tests for the tickets are needed in any case. Also, new keywords should be added to the end, not put in the middle of existing keywords. I haven't reviewed the patches, just tried to get them organized. Also, I have Derek as the author on all of them, that can be changed if it is decided the credit should go elsewhere ;) Thanks for the work you all have been doing on these tickets. Chuck
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
Hi Paul, having had a look at the other tickets you dug up,
My opinions are my own, and in detail, they are: 1752: I attach a possible patch. FWIW, I agree with the request. The patch is written to be compatible with the fix in ticket #1562, but I did not test that yet.
Tested, see also my comments on Trac.
1731: This seems like a rather trivial feature enhancement. I attach a possible patch.
Agreed. Haven't tested it though.
1616: The suggested patch seems reasonable to me, but I do not have a full list of what objects loadtxt supports today as opposed to what this patch will support.
1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however.
See comments on Trac as well.
1458: The fix suggested in the ticket seems reasonable, but I have never used record arrays, so I am not sure of this.
There were some issues with Python3, and I also had some general reservations as noted on Trac - basically, it makes 'unpack' equivalent to transposing for 2D-arrays, but to splitting into fields for 1D-recarrays. My question was, what's going to happen when you get to 2D-recarrays? Currently this is not an issue since loadtxt can only read 2D regular or 1D structured arrays. But this might change if the data block functionality (see below) were to be implemented - data could then be returned as 3D arrays or 2D structured arrays... Still, it would probably make most sense (or at least give the widest functionality) to have 'unpack=True' always return a list or iterator over columns.
1445: Adding this functionality could break old code, as some old datafiles may have empty lines which are now simply ignored. I do not think the feature is a good idea. It could rather be implemented as a separate function. 1107: I do not see the need for this enhancement. In my eyes, the usecols kwarg does this and more. Perhaps I am misunderstanding something here.
Agree about #1445, and the bit about 'usecols' - 'numcols' would just provide a shorter call to e.g. read the first 20 columns of a file (well, not even that much over 'usecols=range(20)'...), don't think that justifies an extra argument. But the 'datablocks' provides something new, that a number of people seem to miss from e.g. gnuplot (including me, actually ;-). And it would also satisfy the request from #1445 without breaking backwards compatibility. I've been wondering if could instead specify the separator lines through the parameter, e.g. "blocksep=['None', 'blank','invalid']", not sure if that would make it more useful...
1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket.
In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or '&'. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with "delimiter='\t'" - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well.
1163: 1565: These tickets seem to have the same origin of the problem. I attach one possible patch. The previously suggested patches that I've seen will not correctly convert floats to ints, which I believe my patch will.
+1, though I am a bit concerned that prompting to raise a ValueError for every element could impede performance. I'd probably still enclose it into an if issubclass(typ, np.uint64) or issubclass(typ, np.int64): just like in npio.patch. I also thought one might switch to int(float128(x)) in that case, but at least for the given examples float128 cannot convert with more accuracy than float64 (even on PowerPC ;-). There were some dissenting opinions that trying to read a float into an int should generally throw an exception though... And Chuck just beat me... On 26 Mar 2011, at 21:25, Charles R Harris wrote:
I put all these patches together at https://github.com/charris/numpy/tree/loadtxt-savetxt . Please pull from there to continue work on loadtxt/savetxt so as to avoid conflicts in the patches. One of the numpy tests is failing, I assume from patch conflicts, and more tests for the tickets are needed in any case. Also, new keywords should be added to the end, not put in the middle of existing keywords.
I haven't reviewed the patches, just tried to get them organized. Also, I have Derek as the author on all of them, that can be changed if it is decided the credit should go elsewhere ;) Thanks for the work you all have been doing on these tickets.
Thanks, I'll have a look at the new ticket and try to get that organized! Cheers, Derek
![](https://secure.gravatar.com/avatar/d98bd91ed2cea43056594b7ce5369b17.jpg?s=120&d=mm&r=g)
On 26. mars 2011, at 21.44, Derek Homeier wrote:
Hi Paul,
having had a look at the other tickets you dug up,
My opinions are my own, and in detail, they are: 1752: I attach a possible patch. FWIW, I agree with the request. The patch is written to be compatible with the fix in ticket #1562, but I did not test that yet.
Tested, see also my comments on Trac.
Great!
1731: This seems like a rather trivial feature enhancement. I attach a possible patch.
Agreed. Haven't tested it though.
Great!
1616: The suggested patch seems reasonable to me, but I do not have a full list of what objects loadtxt supports today as opposed to what this patch will support.
Looks like you got this one. Just remember to make it compatible with #1752. Should be easy.
1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however.
See comments on Trac as well.
Your patch is better, but there is one thing I disagree with. 808 if X.ndim < ndmin: 809 if ndmin == 1: 810 X.shape = (X.size, ) 811 elif ndmin == 2: 812 X.shape = (X.size, 1) The last line should be: 812 X.shape = (1, X.size) If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO.
1458: The fix suggested in the ticket seems reasonable, but I have never used record arrays, so I am not sure of this.
There were some issues with Python3, and I also had some general reservations as noted on Trac - basically, it makes 'unpack' equivalent to transposing for 2D-arrays, but to splitting into fields for 1D-recarrays. My question was, what's going to happen when you get to 2D-recarrays? Currently this is not an issue since loadtxt can only read 2D regular or 1D structured arrays. But this might change if the data block functionality (see below) were to be implemented - data could then be returned as 3D arrays or 2D structured arrays... Still, it would probably make most sense (or at least give the widest functionality) to have 'unpack=True' always return a list or iterator over columns.
OK, I don't know recarrays, as I said.
1445: Adding this functionality could break old code, as some old datafiles may have empty lines which are now simply ignored. I do not think the feature is a good idea. It could rather be implemented as a separate function. 1107: I do not see the need for this enhancement. In my eyes, the usecols kwarg does this and more. Perhaps I am misunderstanding something here.
Agree about #1445, and the bit about 'usecols' - 'numcols' would just provide a shorter call to e.g. read the first 20 columns of a file (well, not even that much over 'usecols=range(20)'...), don't think that justifies an extra argument. But the 'datablocks' provides something new, that a number of people seem to miss from e.g. gnuplot (including me, actually ;-). And it would also satisfy the request from #1445 without breaking backwards compatibility. I've been wondering if could instead specify the separator lines through the parameter, e.g. "blocksep=['None', 'blank','invalid']", not sure if that would make it more useful...
What about writing a separate function, e.g. loadblocktxt, and have it separate the chunks and call loadtxt for each chunk? Just a thought. Another possibility would be to write a function that would let you load a set of text files in a directory, and return a dict of datasets, one per file. One could write a similar save-function, too. They would just need to call loadtxt/savetxt on a per-file basis.
1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket.
In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or '&'. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with "delimiter='\t'" - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well.
Great!
1163: 1565: These tickets seem to have the same origin of the problem. I attach one possible patch. The previously suggested patches that I've seen will not correctly convert floats to ints, which I believe my patch will.
+1, though I am a bit concerned that prompting to raise a ValueError for every element could impede performance. I'd probably still enclose it into an if issubclass(typ, np.uint64) or issubclass(typ, np.int64): just like in npio.patch. I also thought one might switch to int(float128(x)) in that case, but at least for the given examples float128 cannot convert with more accuracy than float64 (even on PowerPC ;-). There were some dissenting opinions that trying to read a float into an int should generally throw an exception though...
And Chuck just beat me...
I am sure someone has been using this functionality to convert floats to ints. Changing will break their code. I am not sure how big a deal that would be. Also, I am of the opinion that one should _first_ write a program that works _correctly_, and only afterwards worry about performance. Even so, using numpy.int64 would be better, because it would give a sensible error message.
On 26 Mar 2011, at 21:25, Charles R Harris wrote:
I put all these patches together at https://github.com/charris/numpy/tree/loadtxt-savetxt . Please pull from there to continue work on loadtxt/savetxt so as to avoid conflicts in the patches. One of the numpy tests is failing, I assume from patch conflicts, and more tests for the tickets are needed in any case. Also, new keywords should be added to the end, not put in the middle of existing keywords.
I haven't reviewed the patches, just tried to get them organized. Also, I have Derek as the author on all of them, that can be changed if it is decided the credit should go elsewhere ;) Thanks for the work you all have been doing on these tickets.
Thanks, I'll have a look at the new ticket and try to get that organized!
With some luck, all the loadtxt tickets should be closed in short time :-) Best, Paul.
![](https://secure.gravatar.com/avatar/af6c39d6943bd4b0e1fde23161e7bb8c.jpg?s=120&d=mm&r=g)
On Sun, Mar 27, 2011 at 12:09 PM, Paul Anton Letnes <paul.anton.letnes@gmail.com> wrote:
I am sure someone has been using this functionality to convert floats to ints. Changing will break their code. I am not sure how big a deal that would be. Also, I am of the opinion that one should _first_ write a program that works _correctly_, and only afterwards worry about performance.
While I'd agree in most cases, keep in mind that np.loadtxt is supposed to be a fast but simpler alternative to np.genfromtxt. If np.loadtxt becomes much slower, there's not much need to keep these separate any longer. Regards Stéfan
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes < paul.anton.letnes@gmail.com> wrote:
On 26. mars 2011, at 21.44, Derek Homeier wrote:
Hi Paul,
having had a look at the other tickets you dug up,
My opinions are my own, and in detail, they are: 1752: I attach a possible patch. FWIW, I agree with the request. The patch is written to be compatible with the fix in ticket #1562, but I did not test that yet.
Tested, see also my comments on Trac.
Great!
1731: This seems like a rather trivial feature enhancement. I attach a possible patch.
Agreed. Haven't tested it though.
Great!
1616: The suggested patch seems reasonable to me, but I do not have a full list of what objects loadtxt supports today as opposed to what this patch will support.
Looks like you got this one. Just remember to make it compatible with #1752. Should be easy.
1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however.
See comments on Trac as well.
Your patch is better, but there is one thing I disagree with. 808 if X.ndim < ndmin: 809 if ndmin == 1: 810 X.shape = (X.size, ) 811 elif ndmin == 2: 812 X.shape = (X.size, 1) The last line should be: 812 X.shape = (1, X.size) If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO.
1458: The fix suggested in the ticket seems reasonable, but I have never used record arrays, so I am not sure of this.
There were some issues with Python3, and I also had some general reservations as noted on Trac - basically, it makes 'unpack' equivalent to transposing for 2D-arrays, but to splitting into fields for 1D-recarrays. My question was, what's going to happen when you get to 2D-recarrays? Currently this is not an issue since loadtxt can only read 2D regular or 1D structured arrays. But this might change if the data block functionality (see below) were to be implemented - data could then be returned as 3D arrays or 2D structured arrays... Still, it would probably make most sense (or at least give the widest functionality) to have 'unpack=True' always return a list or iterator over columns.
OK, I don't know recarrays, as I said.
1445: Adding this functionality could break old code, as some old datafiles may have empty lines which are now simply ignored. I do not think the feature is a good idea. It could rather be implemented as a separate function. 1107: I do not see the need for this enhancement. In my eyes, the usecols kwarg does this and more. Perhaps I am misunderstanding something here.
Agree about #1445, and the bit about 'usecols' - 'numcols' would just provide a shorter call to e.g. read the first 20 columns of a file (well, not even that much over 'usecols=range(20)'...), don't think that justifies an extra argument. But the 'datablocks' provides something new, that a number of people seem to miss from e.g. gnuplot (including me, actually ;-). And it would also satisfy the request from #1445 without breaking backwards compatibility. I've been wondering if could instead specify the separator lines through the parameter, e.g. "blocksep=['None', 'blank','invalid']", not sure if that would make it more useful...
What about writing a separate function, e.g. loadblocktxt, and have it separate the chunks and call loadtxt for each chunk? Just a thought. Another possibility would be to write a function that would let you load a set of text files in a directory, and return a dict of datasets, one per file. One could write a similar save-function, too. They would just need to call loadtxt/savetxt on a per-file basis.
1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket.
In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or '&'. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with "delimiter='\t'" - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well.
Great!
1163: 1565: These tickets seem to have the same origin of the problem. I attach one possible patch. The previously suggested patches that I've seen will not correctly convert floats to ints, which I believe my patch will.
+1, though I am a bit concerned that prompting to raise a ValueError for every element could impede performance. I'd probably still enclose it into an if issubclass(typ, np.uint64) or issubclass(typ, np.int64): just like in npio.patch. I also thought one might switch to int(float128(x)) in that case, but at least for the given examples float128 cannot convert with more accuracy than float64 (even on PowerPC ;-). There were some dissenting opinions that trying to read a float into an int should generally throw an exception though...
And Chuck just beat me...
I am sure someone has been using this functionality to convert floats to ints. Changing will break their code. I am not sure how big a deal that would be. Also, I am of the opinion that one should _first_ write a program that works _correctly_, and only afterwards worry about performance. Even so, using numpy.int64 would be better, because it would give a sensible error message.
On 26 Mar 2011, at 21:25, Charles R Harris wrote:
I put all these patches together at https://github.com/charris/numpy/tree/loadtxt-savetxt . Please pull from there to continue work on loadtxt/savetxt so as to avoid conflicts in the patches. One of the numpy tests is failing, I assume from patch conflicts, and more tests for the tickets are needed in any case. Also, new keywords should be added to the end, not put in the middle of existing keywords.
I haven't reviewed the patches, just tried to get them organized. Also, I have Derek as the author on all of them, that can be changed if it is decided the credit should go elsewhere ;) Thanks for the work you all have been doing on these tickets.
Thanks, I'll have a look at the new ticket and try to get that organized!
With some luck, all the loadtxt tickets should be closed in short time :-)
Can some one bring me up to date on which tickets have been dealt with? Also, how do they divide into bug fixes and enhancements? Chuck
![](https://secure.gravatar.com/avatar/3d3176cf99cae23d0ac119d1ea6c4d11.jpg?s=120&d=mm&r=g)
On Thu, Mar 31, 2011 at 4:53 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes <paul.anton.letnes@gmail.com> wrote:
On 26. mars 2011, at 21.44, Derek Homeier wrote:
Hi Paul,
having had a look at the other tickets you dug up,
My opinions are my own, and in detail, they are: 1752: I attach a possible patch. FWIW, I agree with the request. The patch is written to be compatible with the fix in ticket #1562, but I did not test that yet.
Tested, see also my comments on Trac.
Great!
1731: This seems like a rather trivial feature enhancement. I attach a possible patch.
Agreed. Haven't tested it though.
Great!
1616: The suggested patch seems reasonable to me, but I do not have a full list of what objects loadtxt supports today as opposed to what this patch will support.
Looks like you got this one. Just remember to make it compatible with #1752. Should be easy.
1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however.
See comments on Trac as well.
Your patch is better, but there is one thing I disagree with. 808 if X.ndim < ndmin: 809 if ndmin == 1: 810 X.shape = (X.size, ) 811 elif ndmin == 2: 812 X.shape = (X.size, 1) The last line should be: 812 X.shape = (1, X.size) If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO.
1458: The fix suggested in the ticket seems reasonable, but I have never used record arrays, so I am not sure of this.
There were some issues with Python3, and I also had some general reservations as noted on Trac - basically, it makes 'unpack' equivalent to transposing for 2D-arrays, but to splitting into fields for 1D-recarrays. My question was, what's going to happen when you get to 2D-recarrays? Currently this is not an issue since loadtxt can only read 2D regular or 1D structured arrays. But this might change if the data block functionality (see below) were to be implemented - data could then be returned as 3D arrays or 2D structured arrays... Still, it would probably make most sense (or at least give the widest functionality) to have 'unpack=True' always return a list or iterator over columns.
OK, I don't know recarrays, as I said.
1445: Adding this functionality could break old code, as some old datafiles may have empty lines which are now simply ignored. I do not think the feature is a good idea. It could rather be implemented as a separate function. 1107: I do not see the need for this enhancement. In my eyes, the usecols kwarg does this and more. Perhaps I am misunderstanding something here.
Agree about #1445, and the bit about 'usecols' - 'numcols' would just provide a shorter call to e.g. read the first 20 columns of a file (well, not even that much over 'usecols=range(20)'...), don't think that justifies an extra argument. But the 'datablocks' provides something new, that a number of people seem to miss from e.g. gnuplot (including me, actually ;-). And it would also satisfy the request from #1445 without breaking backwards compatibility. I've been wondering if could instead specify the separator lines through the parameter, e.g. "blocksep=['None', 'blank','invalid']", not sure if that would make it more useful...
What about writing a separate function, e.g. loadblocktxt, and have it separate the chunks and call loadtxt for each chunk? Just a thought. Another possibility would be to write a function that would let you load a set of text files in a directory, and return a dict of datasets, one per file. One could write a similar save-function, too. They would just need to call loadtxt/savetxt on a per-file basis.
1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket.
In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or '&'. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with "delimiter='\t'" - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well.
Great!
1163: 1565: These tickets seem to have the same origin of the problem. I attach one possible patch. The previously suggested patches that I've seen will not correctly convert floats to ints, which I believe my patch will.
+1, though I am a bit concerned that prompting to raise a ValueError for every element could impede performance. I'd probably still enclose it into an if issubclass(typ, np.uint64) or issubclass(typ, np.int64): just like in npio.patch. I also thought one might switch to int(float128(x)) in that case, but at least for the given examples float128 cannot convert with more accuracy than float64 (even on PowerPC ;-). There were some dissenting opinions that trying to read a float into an int should generally throw an exception though...
And Chuck just beat me...
I am sure someone has been using this functionality to convert floats to ints. Changing will break their code. I am not sure how big a deal that would be. Also, I am of the opinion that one should _first_ write a program that works _correctly_, and only afterwards worry about performance. Even so, using numpy.int64 would be better, because it would give a sensible error message.
On 26 Mar 2011, at 21:25, Charles R Harris wrote:
I put all these patches together at https://github.com/charris/numpy/tree/loadtxt-savetxt . Please pull from there to continue work on loadtxt/savetxt so as to avoid conflicts in the patches. One of the numpy tests is failing, I assume from patch conflicts, and more tests for the tickets are needed in any case. Also, new keywords should be added to the end, not put in the middle of existing keywords.
I haven't reviewed the patches, just tried to get them organized. Also, I have Derek as the author on all of them, that can be changed if it is decided the credit should go elsewhere ;) Thanks for the work you all have been doing on these tickets.
Thanks, I'll have a look at the new ticket and try to get that organized!
With some luck, all the loadtxt tickets should be closed in short time :-)
Can some one bring me up to date on which tickets have been dealt with? Also, how do they divide into bug fixes and enhancements?
If you look in Trac under "All Tickets by Milestone" you'll find all nine tickets together under 1.6.0. Five are bug fixes, four are enhancements. There are some missing tests, but all tickets have proposed patches. Cheers, Ralf
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On Thu, Mar 31, 2011 at 8:42 AM, Ralf Gommers <ralf.gommers@googlemail.com>wrote:
On Thu, Mar 31, 2011 at 4:53 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes <paul.anton.letnes@gmail.com> wrote:
<snip> If you look in Trac under "All Tickets by Milestone" you'll find all
nine tickets together under 1.6.0. Five are bug fixes, four are enhancements. There are some missing tests, but all tickets have proposed patches.
OK. I changed 1562 to enhancement because it adds a keyword. With that change the current status looks like this. Bug Fixes: 1163 -- convert int64 correctly 1458 -- make loadtxt unpack structured arrays 1071 -- loadtxt fails if the last column contains empty value, under discussion 1565 -- duplicate of 1163 Enhancements: 1107 -- support for blocks of data, adds two keywords. 1562 -- add ndmin keyword to aid in getting correct dimensions, doesn't apply on top of previous. 1616 -- remove use of readline so input isn't restricted to files. 1731 -- allow loadtxt to read given number of rows, adds keyword. 1752 -- return empty array when empty file encountered, conflicts with 1616. Some of this might should go into genfromtxt. None of the patches have tests. Chuck
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
On Wed, Mar 30, 2011 at 9:53 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes <paul.anton.letnes@gmail.com> wrote:
On 26. mars 2011, at 21.44, Derek Homeier wrote:
Hi Paul,
having had a look at the other tickets you dug up,
[snip]
1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket.
In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or '&'. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with "delimiter='\t'" - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well.
Great!
This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here:
"Each row in the text file must have the same number of values." "genfromtxt : Load data with missing values handled as specified." " This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values." Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail. The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Bruce
![](https://secure.gravatar.com/avatar/3d3176cf99cae23d0ac119d1ea6c4d11.jpg?s=120&d=mm&r=g)
On Thu, Mar 31, 2011 at 5:03 PM, Bruce Southey <bsouthey@gmail.com> wrote:
On Wed, Mar 30, 2011 at 9:53 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes <paul.anton.letnes@gmail.com> wrote:
On 26. mars 2011, at 21.44, Derek Homeier wrote:
Hi Paul,
having had a look at the other tickets you dug up,
[snip]
1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket.
In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or '&'. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with "delimiter='\t'" - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well.
Great!
This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here:
"Each row in the text file must have the same number of values." "genfromtxt : Load data with missing values handled as specified." " This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values."
Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail.
I agree with you Bruce, but it would be easier to discuss this on the tickets instead of here. Could you add your comments there please? Ralf
The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt.
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
On 03/31/2011 10:08 AM, Ralf Gommers wrote:
On Thu, Mar 31, 2011 at 5:03 PM, Bruce Southey<bsouthey@gmail.com> wrote:
On Wed, Mar 30, 2011 at 9:53 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sun, Mar 27, 2011 at 4:09 AM, Paul Anton Letnes <paul.anton.letnes@gmail.com> wrote:
On 26. mars 2011, at 21.44, Derek Homeier wrote:
Hi Paul,
having had a look at the other tickets you dug up,
[snip]
1071: It is not clear to me whether loadtxt is supposed to support missing values in the fashion indicated in the ticket. In principle it should at least allow you to, by the use of converters as described there. The problem is, the default delimiter is described as 'any whitespace', which in the present implementation obviously includes any number of blanks or tabs. These are therefore treated differently from delimiters like ',' or '&'. I'd reckon there are too many people actually relying on this behaviour to silently change it (e.g. I know plenty of tables with columns separated by either one or several tabs depending on the length of the previous entry). But the tab is apparently also treated differently if explicitly specified with "delimiter='\t'" - and in that case using a converter à la {2: lambda s: float(s or 'Nan')} is working for fields in the middle of the line, but not at the end - clearly warrants improvement. I've prepared a patch working for Python3 as well. Great!
This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here:
"Each row in the text file must have the same number of values." "genfromtxt : Load data with missing values handled as specified." " This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values."
Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail. I agree with you Bruce, but it would be easier to discuss this on the tickets instead of here. Could you add your comments there please?
Ralf
'Easier' seems a contradiction when you have use captcha... Sure I will add more comments there. Bruce
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 31 Mar 2011, at 17:03, Bruce Southey wrote:
This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here:
"Each row in the text file must have the same number of values." "genfromtxt : Load data with missing values handled as specified." " This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values."
Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail.
OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement "Converters can also be used to provide a default value for missing data:" then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way).
The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of
The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed?
course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt.
Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket. Cheers, Derek
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
On 03/31/2011 12:02 PM, Derek Homeier wrote:
On 31 Mar 2011, at 17:03, Bruce Southey wrote:
This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here:
"Each row in the text file must have the same number of values." "genfromtxt : Load data with missing values handled as specified." " This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values."
Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail.
OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement "Converters can also be used to provide a default value for missing data:" then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way).
The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed?
course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket.
Cheers, Derek
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion I am really not disagreeing that much with you. Rather that, as you have shown, it is very easy to increase the complexity of examples that loadtxt does not handle. By missing value I mean "when no data value is stored for the variable in the current observation" (via Wikipedia) since encoded missing values (such as '.', 'NA' and 'NaN') can be recovered.
Bruce
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
On 03/31/2011 12:02 PM, Derek Homeier wrote:
On 31 Mar 2011, at 17:03, Bruce Southey wrote:
This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here:
"Each row in the text file must have the same number of values." "genfromtxt : Load data with missing values handled as specified." " This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values."
Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail.
OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement "Converters can also be used to provide a default value for missing data:" then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way).
The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed?
course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket.
Cheers, Derek
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay I see that 1071 got closed which I am fine with.
I think that your following example should be a test because the two spaces should not be removed with a tab delimiter: np.loadtxt(StringIO("aa\tbb\n \t \ncc\t"), delimiter='\t', dtype=np.dtype([('label', 'S4'), ('comment', 'S4')])) Thanks very much for fixing this! Bruce
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey <bsouthey@gmail.com> wrote:
On 03/31/2011 12:02 PM, Derek Homeier wrote:
On 31 Mar 2011, at 17:03, Bruce Southey wrote:
This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here:
"Each row in the text file must have the same number of values." "genfromtxt : Load data with missing values handled as specified." " This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values."
Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail.
OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement "Converters can also be used to provide a default value for missing data:" then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way).
The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed?
course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket.
Cheers, Derek
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay I see that 1071 got closed which I am fine with.
I think that your following example should be a test because the two spaces should not be removed with a tab delimiter: np.loadtxt(StringIO("aa\tbb\n \t \ncc\t"), delimiter='\t', dtype=np.dtype([('label', 'S4'), ('comment', 'S4')]))
Make a test and we'll put it in. Chuck
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
On 04/04/2011 11:20 AM, Charles R Harris wrote:
On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey <bsouthey@gmail.com <mailto:bsouthey@gmail.com>> wrote:
On 03/31/2011 12:02 PM, Derek Homeier wrote: > On 31 Mar 2011, at 17:03, Bruce Southey wrote: > >> This is an invalid ticket because the docstring clearly states that in >> 3 different, yet critical places, that missing values are not handled >> here: >> >> "Each row in the text file must have the same number of values." >> "genfromtxt : Load data with missing values handled as specified." >> " This function aims to be a fast reader for simply formatted >> files. The >> `genfromtxt` function provides more sophisticated handling of, >> e.g., >> lines with missing values." >> >> Really I am trying to separate the usage of loadtxt and genfromtxt to >> avoid unnecessary duplication and confusion. Part of this is >> historical because loadtxt was added in 2007 and genfromtxt was added >> in 2009. So really certain features of loadtxt have been 'kept' for >> backwards compatibility purposes yet these features can be 'abused' to >> handle missing data. But I really consider that any missing values >> should cause loadtxt to fail. >> > OK, I was not aware of the design issues of loadtxt vs. genfromtxt - > you could probably say also for historical reasons since I have not > used genfromtxt much so far. > Anyway the docstring statement "Converters can also be used to > provide a default value for missing data:" > then appears quite misleading, or an invitation to abuse, if you will. > This should better be removed from the documentation then, or users > explicitly discouraged from using converters instead of genfromtxt > (I don't see how you could completely prevent using converters in > this way). > >> The patch is incorrect because it should not include a space in the >> split() as indicated in the comment by the original reporter. Of > The split('\r\n') alone caused test_dtype_with_object(self) to fail, > probably > because it relies on stripping the blanks. But maybe the test is ill- > formed? > >> course a corrected patch alone still is not sufficient to address the >> problem without the user providing the correct converter. Also you >> start to run into problems with multiple delimiters (such as one space >> versus two spaces) so you start down the path to add all the features >> that duplicate genfromtxt. > Given that genfromtxt provides that functionality more conveniently, > I agree again users should be encouraged to use this instead of > converters. > But the actual tab-problem causes in fact an issue not related to > missing > values at all (well, depending on what you call a missing value). > I am describing an example on the ticket. > > Cheers, > Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> > http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay I see that 1071 got closed which I am fine with.
I think that your following example should be a test because the two spaces should not be removed with a tab delimiter: np.loadtxt(StringIO("aa\tbb\n \t \ncc\t"), delimiter='\t', dtype=np.dtype([('label', 'S4'), ('comment', 'S4')]))
Make a test and we'll put it in.
Chuck
I know! Trying to write one made me realize that loadtxt is not handling string arrays correctly. So I have to check more on this as I think loadtxt is giving a 1-d array instead of a 2-d array. I do agree with you Pierre but this is a nice corner case that Derek raised where a space does not necessarily mean a missing value when there is a tab delimiter: data = StringIO("aa\tbb\n \t \ncc\tdd") dt=np.dtype([('label', 'S2'), ('comment', 'S2')]) test = np.loadtxt(data, delimiter="\t", dtype=dt) control = np.array([['aa','bb'], [' ', ' '],['cc','dd']], dtype=dt) So 'test' and 'control' should give the same array. Bruce
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On Mon, Apr 4, 2011 at 11:01 AM, Bruce Southey <bsouthey@gmail.com> wrote:
On 04/04/2011 11:20 AM, Charles R Harris wrote:
On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey <bsouthey@gmail.com> wrote:
On 03/31/2011 12:02 PM, Derek Homeier wrote:
On 31 Mar 2011, at 17:03, Bruce Southey wrote:
This is an invalid ticket because the docstring clearly states that in 3 different, yet critical places, that missing values are not handled here:
"Each row in the text file must have the same number of values." "genfromtxt : Load data with missing values handled as specified." " This function aims to be a fast reader for simply formatted files. The `genfromtxt` function provides more sophisticated handling of, e.g., lines with missing values."
Really I am trying to separate the usage of loadtxt and genfromtxt to avoid unnecessary duplication and confusion. Part of this is historical because loadtxt was added in 2007 and genfromtxt was added in 2009. So really certain features of loadtxt have been 'kept' for backwards compatibility purposes yet these features can be 'abused' to handle missing data. But I really consider that any missing values should cause loadtxt to fail.
OK, I was not aware of the design issues of loadtxt vs. genfromtxt - you could probably say also for historical reasons since I have not used genfromtxt much so far. Anyway the docstring statement "Converters can also be used to provide a default value for missing data:" then appears quite misleading, or an invitation to abuse, if you will. This should better be removed from the documentation then, or users explicitly discouraged from using converters instead of genfromtxt (I don't see how you could completely prevent using converters in this way).
The patch is incorrect because it should not include a space in the split() as indicated in the comment by the original reporter. Of The split('\r\n') alone caused test_dtype_with_object(self) to fail, probably because it relies on stripping the blanks. But maybe the test is ill- formed?
course a corrected patch alone still is not sufficient to address the problem without the user providing the correct converter. Also you start to run into problems with multiple delimiters (such as one space versus two spaces) so you start down the path to add all the features that duplicate genfromtxt. Given that genfromtxt provides that functionality more conveniently, I agree again users should be encouraged to use this instead of converters. But the actual tab-problem causes in fact an issue not related to missing values at all (well, depending on what you call a missing value). I am describing an example on the ticket.
Cheers, Derek
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay I see that 1071 got closed which I am fine with.
I think that your following example should be a test because the two spaces should not be removed with a tab delimiter: np.loadtxt(StringIO("aa\tbb\n \t \ncc\t"), delimiter='\t', dtype=np.dtype([('label', 'S4'), ('comment', 'S4')]))
Make a test and we'll put it in.
Chuck
I know! Trying to write one made me realize that loadtxt is not handling string arrays correctly. So I have to check more on this as I think loadtxt is giving a 1-d array instead of a 2-d array.
Tests often have that side effect. <snip> Chuck
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
On 04/04/2011 12:38 PM, Charles R Harris wrote:
On Mon, Apr 4, 2011 at 11:01 AM, Bruce Southey <bsouthey@gmail.com <mailto:bsouthey@gmail.com>> wrote:
On 04/04/2011 11:20 AM, Charles R Harris wrote:
On Mon, Apr 4, 2011 at 9:59 AM, Bruce Southey <bsouthey@gmail.com <mailto:bsouthey@gmail.com>> wrote:
On 03/31/2011 12:02 PM, Derek Homeier wrote: > On 31 Mar 2011, at 17:03, Bruce Southey wrote: > >> This is an invalid ticket because the docstring clearly states that in >> 3 different, yet critical places, that missing values are not handled >> here: >> >> "Each row in the text file must have the same number of values." >> "genfromtxt : Load data with missing values handled as specified." >> " This function aims to be a fast reader for simply formatted >> files. The >> `genfromtxt` function provides more sophisticated handling of, >> e.g., >> lines with missing values." >> >> Really I am trying to separate the usage of loadtxt and genfromtxt to >> avoid unnecessary duplication and confusion. Part of this is >> historical because loadtxt was added in 2007 and genfromtxt was added >> in 2009. So really certain features of loadtxt have been 'kept' for >> backwards compatibility purposes yet these features can be 'abused' to >> handle missing data. But I really consider that any missing values >> should cause loadtxt to fail. >> > OK, I was not aware of the design issues of loadtxt vs. genfromtxt - > you could probably say also for historical reasons since I have not > used genfromtxt much so far. > Anyway the docstring statement "Converters can also be used to > provide a default value for missing data:" > then appears quite misleading, or an invitation to abuse, if you will. > This should better be removed from the documentation then, or users > explicitly discouraged from using converters instead of genfromtxt > (I don't see how you could completely prevent using converters in > this way). > >> The patch is incorrect because it should not include a space in the >> split() as indicated in the comment by the original reporter. Of > The split('\r\n') alone caused test_dtype_with_object(self) to fail, > probably > because it relies on stripping the blanks. But maybe the test is ill- > formed? > >> course a corrected patch alone still is not sufficient to address the >> problem without the user providing the correct converter. Also you >> start to run into problems with multiple delimiters (such as one space >> versus two spaces) so you start down the path to add all the features >> that duplicate genfromtxt. > Given that genfromtxt provides that functionality more conveniently, > I agree again users should be encouraged to use this instead of > converters. > But the actual tab-problem causes in fact an issue not related to > missing > values at all (well, depending on what you call a missing value). > I am describing an example on the ticket. > > Cheers, > Derek > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org <mailto:NumPy-Discussion@scipy.org> > http://mail.scipy.org/mailman/listinfo/numpy-discussion Okay I see that 1071 got closed which I am fine with.
I think that your following example should be a test because the two spaces should not be removed with a tab delimiter: np.loadtxt(StringIO("aa\tbb\n \t \ncc\t"), delimiter='\t', dtype=np.dtype([('label', 'S4'), ('comment', 'S4')]))
Make a test and we'll put it in.
Chuck
I know! Trying to write one made me realize that loadtxt is not handling string arrays correctly. So I have to check more on this as I think loadtxt is giving a 1-d array instead of a 2-d array.
Tests often have that side effect.
<snip>
Chuck
Okay, My confusion aside (sorry for that), I added ticket 1784 with a possible test that should work with ticket 1071: http://projects.scipy.org/numpy/ticket/1794 Bruce
![](https://secure.gravatar.com/avatar/56b215661867f3b4f4a3b28077de66b3.jpg?s=120&d=mm&r=g)
Y'all, Just a quick word before disappearing again: genfromtxt was designed to support files that don't have a well-defined structure (eg, the number and format of the column is unknown, some entries may be missing, and so forth). That was initially just an improved rec2csv from matplotlib back then, which eventually got rewritten a lot. On the other hand, loadtxt was designed to support the simpler case of well-structred files (you know the number of columns beforehand, you know their type...). It's simpler, but also far faster than genfromtxt, and it should stay like that. Missing values (and I'm talking about empty "cells") should not be covered by loadtxt, that's a job for genfromtxt: just raise an exception asking the user to try with another function. Anyhow, sorry for not being responsive, I don't have much time to work on numpy these days (which saddens me but pays my bills) but don't hesitate to contact me if there's a real issue somewhere. P.
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
Hi Paul, I've got back to your suggestion re. the ndmin flag for loadtxt from a few weeks ago... On 27.03.2011, at 12:09PM, Paul Anton Letnes wrote:
1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however.
See comments on Trac as well.
Your patch is better, but there is one thing I disagree with. 808 if X.ndim < ndmin: 809 if ndmin == 1: 810 X.shape = (X.size, ) 811 elif ndmin == 2: 812 X.shape = (X.size, 1) The last line should be: 812 X.shape = (1, X.size) If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO.
I think you are completely right for the test case with one row. More generally though, since a file of N rows and M columns is read into an array of shape (N, M), ndmin=2 should enforce X.shape = (1, X.size) for single-row input, and X.shape = (X.size, 1) for single-column input. I thought this would be handled automatically by preserving the original 2 dimensions, but apparently with single-row/multi-column input an extra dimension 1 is prepended when the array is returned from the parser. I've put up a fix for this at https://github.com/dhomeier/numpy/compare/master...ndmin-cols and also tested the patch against 1.6.0.rc2. Cheers, Derek
![](https://secure.gravatar.com/avatar/d98bd91ed2cea43056594b7ce5369b17.jpg?s=120&d=mm&r=g)
On 4. mai 2011, at 17.34, Derek Homeier wrote:
Hi Paul,
I've got back to your suggestion re. the ndmin flag for loadtxt from a few weeks ago...
On 27.03.2011, at 12:09PM, Paul Anton Letnes wrote:
1562: I attach a possible patch. This could also be the default behavior to my mind, since the function caller can simply call numpy.squeeze if needed. Changing default behavior would probably break old code, however.
See comments on Trac as well.
Your patch is better, but there is one thing I disagree with. 808 if X.ndim < ndmin: 809 if ndmin == 1: 810 X.shape = (X.size, ) 811 elif ndmin == 2: 812 X.shape = (X.size, 1) The last line should be: 812 X.shape = (1, X.size) If someone wants a 2D array out, they would most likely expect a one-row file to come out as a one-row array, not the other way around. IMHO.
I think you are completely right for the test case with one row. More generally though, since a file of N rows and M columns is read into an array of shape (N, M), ndmin=2 should enforce X.shape = (1, X.size) for single-row input, and X.shape = (X.size, 1) for single-column input. I thought this would be handled automatically by preserving the original 2 dimensions, but apparently with single-row/multi-column input an extra dimension 1 is prepended when the array is returned from the parser. I've put up a fix for this at
https://github.com/dhomeier/numpy/compare/master...ndmin-cols
and also tested the patch against 1.6.0.rc2.
Cheers, Derek
Looks sensible to me at least! But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? Paul
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem?
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc... Cheers, Derek
![](https://secure.gravatar.com/avatar/09939f25b639512a537ce2c90f77f958.jpg?s=120&d=mm&r=g)
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier < derek@astro.physik.uni-goettingen.de> wrote:
On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem?
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc...
Cheers, Derek
I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap). Ben Root
![](https://secure.gravatar.com/avatar/d98bd91ed2cea43056594b7ce5369b17.jpg?s=120&d=mm&r=g)
On 4. mai 2011, at 20.33, Benjamin Root wrote:
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <derek@astro.physik.uni-goettingen.de> wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem?
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc...
Cheers, Derek
I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case?
I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap).
What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ... Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array. Paul
![](https://secure.gravatar.com/avatar/09939f25b639512a537ce2c90f77f958.jpg?s=120&d=mm&r=g)
On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes < paul.anton.letnes@gmail.com> wrote:
On 4. mai 2011, at 20.33, Benjamin Root wrote:
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier < derek@astro.physik.uni-goettingen.de> wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem?
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc...
Cheers, Derek
I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case?
I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap).
What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ...
Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array.
Paul
@Derek, good catch with noticing the error in the tests. We do still need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed. @Paul, we can't call squeeze after doing the atleast_Nd() magic. That would just undo whatever we had just done. Also, wrt the transpose, a (1, 100000) array looks the same in memory as a (100000, 1) array, right? Ben Root
![](https://secure.gravatar.com/avatar/09939f25b639512a537ce2c90f77f958.jpg?s=120&d=mm&r=g)
On Thu, May 5, 2011 at 10:49 AM, Benjamin Root <ben.root@ou.edu> wrote:
On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes < paul.anton.letnes@gmail.com> wrote:
On 4. mai 2011, at 20.33, Benjamin Root wrote:
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier < derek@astro.physik.uni-goettingen.de> wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem?
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc...
Cheers, Derek
I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case?
I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap).
What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ...
Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array.
Paul
@Derek, good catch with noticing the error in the tests. We do still need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed.
Apologies Derek, your patch does address the issue I raised. Ben Root
![](https://secure.gravatar.com/avatar/d98bd91ed2cea43056594b7ce5369b17.jpg?s=120&d=mm&r=g)
On 5. mai 2011, at 08.49, Benjamin Root wrote:
On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <paul.anton.letnes@gmail.com> wrote:
On 4. mai 2011, at 20.33, Benjamin Root wrote:
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <derek@astro.physik.uni-goettingen.de> wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem?
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access etc...
Cheers, Derek
I can confirm that the current behavior is not sufficient for all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case?
I think that using atleast_1d(X) might be a bit overkill, but it would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap).
What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ...
Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array.
Paul
@Derek, good catch with noticing the error in the tests. We do still need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed.
@Paul, we can't call squeeze after doing the atleast_Nd() magic. That would just undo whatever we had just done. Also, wrt the transpose, a (1, 100000) array looks the same in memory as a (100000, 1) array, right?
Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError? That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is! Paul
![](https://secure.gravatar.com/avatar/09939f25b639512a537ce2c90f77f958.jpg?s=120&d=mm&r=g)
On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes < paul.anton.letnes@gmail.com> wrote:
On 5. mai 2011, at 08.49, Benjamin Root wrote:
On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <
paul.anton.letnes@gmail.com> wrote:
On 4. mai 2011, at 20.33, Benjamin Root wrote:
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <
On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access
etc...
Cheers, Derek
I can confirm that the current behavior is not sufficient for all of
I think that using atleast_1d(X) might be a bit overkill, but it would
be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater
derek@astro.physik.uni-goettingen.de> wrote: perhaps it will reintroduce the 'transposed' problem? the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap).
What if one does things the other way around - avoid calling squeeze
until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ...
Oh, and someone could conceivably have a _looong_ 1D file, but would want
it read as a 2D array.
Paul
@Derek, good catch with noticing the error in the tests. We do still need
to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed.
@Paul, we can't call squeeze after doing the atleast_Nd() magic. That
would just undo whatever we had just done. Also, wrt the transpose, a (1, 100000) array looks the same in memory as a (100000, 1) array, right? Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError?
That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is!
Paul
I have thought of that too, but the problem with that approach is that after reading the file, X will have 2 or 3 dimensions, regardless of how many singleton dims were in the file. A squeeze will always be needed. Also, the purpose of squeeze is opposite that of the atleast_*d() functions: squeeze reduces dimensions, while atleast_*d will add dimensions. Therefore, I re-iterate... the patch by Derek gets the job done. I have tested it for a wide variety of inputs for both regular arrays and record arrays. Is there room for improvements? Yes, but I think that can wait for later. Derek's patch however fixes an important bug in the ndmin implementation and should be included for the release. Ben Root
![](https://secure.gravatar.com/avatar/3d3176cf99cae23d0ac119d1ea6c4d11.jpg?s=120&d=mm&r=g)
On Thu, May 5, 2011 at 9:18 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes < paul.anton.letnes@gmail.com> wrote:
On 5. mai 2011, at 08.49, Benjamin Root wrote:
On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <
paul.anton.letnes@gmail.com> wrote:
On 4. mai 2011, at 20.33, Benjamin Root wrote:
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <
On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access
etc...
Cheers, Derek
I can confirm that the current behavior is not sufficient for all of
I think that using atleast_1d(X) might be a bit overkill, but it would
be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater
derek@astro.physik.uni-goettingen.de> wrote: perhaps it will reintroduce the 'transposed' problem? the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap).
What if one does things the other way around - avoid calling squeeze
until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ...
Oh, and someone could conceivably have a _looong_ 1D file, but would
want it read as a 2D array.
Paul
@Derek, good catch with noticing the error in the tests. We do still
need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed.
@Paul, we can't call squeeze after doing the atleast_Nd() magic. That
would just undo whatever we had just done. Also, wrt the transpose, a (1, 100000) array looks the same in memory as a (100000, 1) array, right? Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError?
That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is!
Paul
I have thought of that too, but the problem with that approach is that after reading the file, X will have 2 or 3 dimensions, regardless of how many singleton dims were in the file. A squeeze will always be needed. Also, the purpose of squeeze is opposite that of the atleast_*d() functions: squeeze reduces dimensions, while atleast_*d will add dimensions.
Therefore, I re-iterate... the patch by Derek gets the job done. I have tested it for a wide variety of inputs for both regular arrays and record arrays. Is there room for improvements? Yes, but I think that can wait for later. Derek's patch however fixes an important bug in the ndmin implementation and should be included for the release.
Two questions: can you point me to the patch/ticket, and is this a
regression? Thanks, Ralf
![](https://secure.gravatar.com/avatar/09939f25b639512a537ce2c90f77f958.jpg?s=120&d=mm&r=g)
On Thu, May 5, 2011 at 2:33 PM, Ralf Gommers <ralf.gommers@googlemail.com>wrote:
On Thu, May 5, 2011 at 9:18 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes < paul.anton.letnes@gmail.com> wrote:
On 5. mai 2011, at 08.49, Benjamin Root wrote:
On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <
paul.anton.letnes@gmail.com> wrote:
On 4. mai 2011, at 20.33, Benjamin Root wrote:
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <
On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access
etc...
Cheers, Derek
I can confirm that the current behavior is not sufficient for all of
I think that using atleast_1d(X) might be a bit overkill, but it
would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is
derek@astro.physik.uni-goettingen.de> wrote: perhaps it will reintroduce the 'transposed' problem? the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? trivial), or 1-d (in which a transpose is cheap).
What if one does things the other way around - avoid calling squeeze
until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ...
Oh, and someone could conceivably have a _looong_ 1D file, but would
want it read as a 2D array.
Paul
@Derek, good catch with noticing the error in the tests. We do still
need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed.
@Paul, we can't call squeeze after doing the atleast_Nd() magic. That
would just undo whatever we had just done. Also, wrt the transpose, a (1, 100000) array looks the same in memory as a (100000, 1) array, right? Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError?
That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is!
Paul
I have thought of that too, but the problem with that approach is that after reading the file, X will have 2 or 3 dimensions, regardless of how many singleton dims were in the file. A squeeze will always be needed. Also, the purpose of squeeze is opposite that of the atleast_*d() functions: squeeze reduces dimensions, while atleast_*d will add dimensions.
Therefore, I re-iterate... the patch by Derek gets the job done. I have tested it for a wide variety of inputs for both regular arrays and record arrays. Is there room for improvements? Yes, but I think that can wait for later. Derek's patch however fixes an important bug in the ndmin implementation and should be included for the release.
Two questions: can you point me to the patch/ticket, and is this a
regression?
Thanks, Ralf
I don't know if he did a pull-request or not, but here is the link he provided earlier in the thread. https://github.com/dhomeier/numpy/compare/master...ndmin-cols Technically, this is not a "regression" as the ndmin feature is new in this release. However, the problem that ndmin is supposed to address is not fixed by the current implementation for the rc. Essentially, a single-row, multi-column file with ndmin=2 comes out as a Nx1 array which is the same result for a multi-row, single-column file. My feeling is that if we let the current implementation stand as is, and developers use it in their code, then fixing it in a later release would introduce more problems (maybe the devels would transpose the result themselves or something). Better to fix it now in rc with the two lines of code (and the correction to the tests), then to introduce a buggy feature that will be hard to fix in future releases, IMHO. Ben Root
![](https://secure.gravatar.com/avatar/3d3176cf99cae23d0ac119d1ea6c4d11.jpg?s=120&d=mm&r=g)
On Thu, May 5, 2011 at 9:46 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Thu, May 5, 2011 at 2:33 PM, Ralf Gommers <ralf.gommers@googlemail.com>wrote:
On Thu, May 5, 2011 at 9:18 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes < paul.anton.letnes@gmail.com> wrote:
On 5. mai 2011, at 08.49, Benjamin Root wrote:
On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <
paul.anton.letnes@gmail.com> wrote:
On 4. mai 2011, at 20.33, Benjamin Root wrote:
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <
On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
> But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access
etc...
Cheers, Derek
I can confirm that the current behavior is not sufficient for all of
I think that using atleast_1d(X) might be a bit overkill, but it
would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is
derek@astro.physik.uni-goettingen.de> wrote: perhaps it will reintroduce the 'transposed' problem? the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? trivial), or 1-d (in which a transpose is cheap).
What if one does things the other way around - avoid calling squeeze
until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ...
Oh, and someone could conceivably have a _looong_ 1D file, but would
want it read as a 2D array.
Paul
@Derek, good catch with noticing the error in the tests. We do still
need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed.
@Paul, we can't call squeeze after doing the atleast_Nd() magic. That
would just undo whatever we had just done. Also, wrt the transpose, a (1, 100000) array looks the same in memory as a (100000, 1) array, right? Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError?
That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is!
Paul
I have thought of that too, but the problem with that approach is that after reading the file, X will have 2 or 3 dimensions, regardless of how many singleton dims were in the file. A squeeze will always be needed. Also, the purpose of squeeze is opposite that of the atleast_*d() functions: squeeze reduces dimensions, while atleast_*d will add dimensions.
Therefore, I re-iterate... the patch by Derek gets the job done. I have tested it for a wide variety of inputs for both regular arrays and record arrays. Is there room for improvements? Yes, but I think that can wait for later. Derek's patch however fixes an important bug in the ndmin implementation and should be included for the release.
Two questions: can you point me to the patch/ticket, and is this a
regression?
Thanks, Ralf
I don't know if he did a pull-request or not, but here is the link he provided earlier in the thread.
https://github.com/dhomeier/numpy/compare/master...ndmin-cols
Technically, this is not a "regression" as the ndmin feature is new in this release.
Yes right, I forgot this was a recent change.
However, the problem that ndmin is supposed to address is not fixed by the current implementation for the rc. Essentially, a single-row, multi-column file with ndmin=2 comes out as a Nx1 array which is the same result for a multi-row, single-column file. My feeling is that if we let the current implementation stand as is, and developers use it in their code, then fixing it in a later release would introduce more problems (maybe the devels would transpose the result themselves or something). Better to fix it now in rc with the two lines of code (and the correction to the tests), then to introduce a buggy feature that will be hard to fix in future releases, IMHO.
Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3.
If in the meantime a few more people could test this, that would be helpful. Ralf
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 5 May 2011, at 22:31, Ralf Gommers wrote:
On Thu, May 5, 2011 at 9:46 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Thu, May 5, 2011 at 2:33 PM, Ralf Gommers <ralf.gommers@googlemail.com
wrote:
On Thu, May 5, 2011 at 9:18 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes <paul.anton.letnes@gmail.com
wrote:
On 5. mai 2011, at 08.49, Benjamin Root wrote:
On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <paul.anton.letnes@gmail.com wrote:
On 4. mai 2011, at 20.33, Benjamin Root wrote:
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <derek@astro.physik.uni-goettingen.de wrote: On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory
access etc...
Cheers, Derek
I can confirm that the current behavior is not sufficient for
all of the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting
written for this? Shouldn't we reuse them? Perhaps it's overkill, and perhaps it will reintroduce the 'transposed' problem? the number of rows and columns read? Could we use those to help guide the ndmin=2 case?
I think that using atleast_1d(X) might be a bit overkill, but it
would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is trivial), or 1-d (in which a transpose is cheap).
What if one does things the other way around - avoid calling squeeze until _after_ doing the atleast_Nd() magic? That way the row/
column information should be conserved, right? Also, we avoid transposing, memory use, ... > > Oh, and someone could conceivably have a _looong_ 1D file, but would want it read as a 2D array. > > Paul > > > > @Derek, good catch with noticing the error in the tests. We do still need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed. > > @Paul, we can't call squeeze after doing the atleast_Nd() magic. That would just undo whatever we had just done. Also, wrt the transpose, a (1, 100000) array looks the same in memory as a (100000, 1) array, right? Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError?
That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is!
Paul
I have thought of that too, but the problem with that approach is that after reading the file, X will have 2 or 3 dimensions, regardless of how many singleton dims were in the file. A squeeze will always be needed. Also, the purpose of squeeze is opposite that of the atleast_*d() functions: squeeze reduces dimensions, while atleast_*d will add dimensions.
Therefore, I re-iterate... the patch by Derek gets the job done. I have tested it for a wide variety of inputs for both regular arrays and record arrays. Is there room for improvements? Yes, but I think that can wait for later. Derek's patch however fixes an important bug in the ndmin implementation and should be included for the release.
Thanks for the additional tests and discussion, Paul and Ben! Yes, I think the main issue is that multi-column input first comes back as 3- dimensional, with a "singleton" dimension in front, so I was getting rid of that first - maybe could add a comment on that. Single-column, multi-row apparently already come back as 1-dimensional, so we only need the expansion there.
Two questions: can you point me to the patch/ticket, and is this a regression?
Thanks, Ralf
This was ticket #1562, had been closed as "resolved" in the meantime, though.
I don't know if he did a pull-request or not, but here is the link he provided earlier in the thread.
https://github.com/dhomeier/numpy/compare/master...ndmin-cols
Technically, this is not a "regression" as the ndmin feature is new in this release.
Yes right, I forgot this was a recent change.
However, the problem that ndmin is supposed to address is not fixed by the current implementation for the rc. Essentially, a single- row, multi-column file with ndmin=2 comes out as a Nx1 array which is the same result for a multi-row, single-column file. My feeling is that if we let the current implementation stand as is, and developers use it in their code, then fixing it in a later release would introduce more problems (maybe the devels would transpose the result themselves or something). Better to fix it now in rc with the two lines of code (and the correction to the tests), then to introduce a buggy feature that will be hard to fix in future releases, IMHO.
Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3.
If in the meantime a few more people could test this, that would be helpful.
Ralf
I agree, wish I had time to push this before rc2. I could add the explanatory comments mentioned above and switch to use the atleast_[12]d() solution, test that and push it in a couple of minutes, or should I better leave it as is now for testing? Cheers, Derek
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 5 May 2011, at 22:53, Derek Homeier wrote:
However, the problem that ndmin is supposed to address is not fixed by the current implementation for the rc. Essentially, a single- row, multi-column file with ndmin=2 comes out as a Nx1 array which is the same result for a multi-row, single-column file. My feeling is that if we let the current implementation stand as is, and developers use it in their code, then fixing it in a later release would introduce more problems (maybe the devels would transpose the result themselves or something). Better to fix it now in rc with the two lines of code (and the correction to the tests), then to introduce a buggy feature that will be hard to fix in future releases, IMHO.
Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3.
If in the meantime a few more people could test this, that would be helpful.
Ralf
I agree, wish I had time to push this before rc2. I could add the explanatory comments mentioned above and switch to use the atleast_[12]d() solution, test that and push it in a couple of minutes, or should I better leave it as is now for testing?
Quick follow-up: I just applied the above changes, added some tests to cover Ben's test cases and tested this with 1.6.0rc2 on OS X 10.5 i386+ppc + 10.6 x86_64 (Python2.7+3.2). So I'd be ready to push it to my repo and do my (first) pull request... Cheers, Derek
![](https://secure.gravatar.com/avatar/3d3176cf99cae23d0ac119d1ea6c4d11.jpg?s=120&d=mm&r=g)
On Fri, May 6, 2011 at 12:12 AM, Derek Homeier < derek@astro.physik.uni-goettingen.de> wrote:
On 5 May 2011, at 22:53, Derek Homeier wrote:
However, the problem that ndmin is supposed to address is not fixed by the current implementation for the rc. Essentially, a single- row, multi-column file with ndmin=2 comes out as a Nx1 array which is the same result for a multi-row, single-column file. My feeling is that if we let the current implementation stand as is, and developers use it in their code, then fixing it in a later release would introduce more problems (maybe the devels would transpose the result themselves or something). Better to fix it now in rc with the two lines of code (and the correction to the tests), then to introduce a buggy feature that will be hard to fix in future releases, IMHO.
Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3.
If in the meantime a few more people could test this, that would be helpful.
Ralf
I agree, wish I had time to push this before rc2. I could add the explanatory comments mentioned above and switch to use the atleast_[12]d() solution, test that and push it in a couple of minutes, or should I better leave it as is now for testing?
Quick follow-up: I just applied the above changes, added some tests to cover Ben's test cases and tested this with 1.6.0rc2 on OS X 10.5 i386+ppc + 10.6 x86_64 (Python2.7+3.2). So I'd be ready to push it to my repo and do my (first) pull request...
Go ahead, I'll have a look at it tonight. Thanks for testing on several Pythons, that definitely helps. Ralf
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 6 May 2011, at 07:53, Ralf Gommers wrote:
Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3.
If in the meantime a few more people could test this, that would be helpful.
Ralf
I agree, wish I had time to push this before rc2. I could add the explanatory comments mentioned above and switch to use the atleast_[12]d() solution, test that and push it in a couple of minutes, or should I better leave it as is now for testing?
Quick follow-up: I just applied the above changes, added some tests to cover Ben's test cases and tested this with 1.6.0rc2 on OS X 10.5 i386+ppc + 10.6 x86_64 (Python2.7+3.2). So I'd be ready to push it to my repo and do my (first) pull request...
Go ahead, I'll have a look at it tonight. Thanks for testing on several Pythons, that definitely helps.
Done, the request only appears on my repo https://github.com/dhomeier/numpy/ is that correct? If someone could test it on Linux and Windows as well... Cheers, Derek
![](https://secure.gravatar.com/avatar/3d3176cf99cae23d0ac119d1ea6c4d11.jpg?s=120&d=mm&r=g)
On Fri, May 6, 2011 at 12:57 PM, Derek Homeier < derek@astro.physik.uni-goettingen.de> wrote:
On 6 May 2011, at 07:53, Ralf Gommers wrote:
Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3.
If in the meantime a few more people could test this, that would be helpful.
Ralf
I agree, wish I had time to push this before rc2. I could add the explanatory comments mentioned above and switch to use the atleast_[12]d() solution, test that and push it in a couple of minutes, or should I better leave it as is now for testing?
Quick follow-up: I just applied the above changes, added some tests to cover Ben's test cases and tested this with 1.6.0rc2 on OS X 10.5 i386+ppc + 10.6 x86_64 (Python2.7+3.2). So I'd be ready to push it to my repo and do my (first) pull request...
Go ahead, I'll have a look at it tonight. Thanks for testing on several Pythons, that definitely helps.
Done, the request only appears on my repo https://github.com/dhomeier/numpy/
is that correct? If someone could test it on Linux and Windows as well...
Committed, thanks for all the work. The pull request was in the wrong place, that's a minor flaw in the github UI. After you press "Pull Request" you need to read the small print to see where it's going. Cheers, Ralf
![](https://secure.gravatar.com/avatar/52c7382c93a8c0df84ac5457c7e27cf0.jpg?s=120&d=mm&r=g)
Ralf Gommers-2 wrote:
On Fri, May 6, 2011 at 12:57 PM, Derek Homeier < derek@astro.physik.uni-goettingen.de> wrote:
On 6 May 2011, at 07:53, Ralf Gommers wrote:
Looks okay, and I agree that it's better to fix it now. The timing is a bit unfortunate though, just after RC2. I'll have closer look tomorrow and if it can go in, probably tag RC3.
If in the meantime a few more people could test this, that would be helpful.
Ralf
I agree, wish I had time to push this before rc2. I could add the explanatory comments mentioned above and switch to use the atleast_[12]d() solution, test that and push it in a couple of minutes, or should I better leave it as is now for testing?
Quick follow-up: I just applied the above changes, added some tests to cover Ben's test cases and tested this with 1.6.0rc2 on OS X 10.5 i386+ppc + 10.6 x86_64 (Python2.7+3.2). So I'd be ready to push it to my repo and do my (first) pull request...
Go ahead, I'll have a look at it tonight. Thanks for testing on several Pythons, that definitely helps.
Done, the request only appears on my repo https://github.com/dhomeier/numpy/
is that correct? If someone could test it on Linux and Windows as well...
Committed, thanks for all the work.
The pull request was in the wrong place, that's a minor flaw in the github UI. After you press "Pull Request" you need to read the small print to see where it's going.
Cheers, Ralf
Dear all, I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen? Best, Chris -- View this message in context: http://old.nabble.com/loadtxt-savetxt-tickets-tp31238871p31740152.html Sent from the Numpy-discussion mailing list archive at Nabble.com.
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
Hi Chris, On 31 May 2011, at 13:56, cgraves wrote:
I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen?
you are mistaken, genfromtxt is not using loadtxt (and could not possibly, since it has the more complex parser to handle missing data); thus ndmin could not be inherited automatically. It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it. Does anyone have an opinion whether this is a case for reopening (yet again...) http://projects.scipy.org/numpy/ticket/1562 or create a new ticket? Cheers, Derek
![](https://secure.gravatar.com/avatar/56b215661867f3b4f4a3b28077de66b3.jpg?s=120&d=mm&r=g)
On May 31, 2011, at 4:53 PM, Derek Homeier wrote:
Hi Chris,
On 31 May 2011, at 13:56, cgraves wrote:
I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen?
you are mistaken, genfromtxt is not using loadtxt (and could not possibly, since it has the more complex parser to handle missing data); thus ndmin could not be inherited automatically. It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it.
Oh, that shouldn't be too difficult: 'ndmin' tells whether the array must be squeezed before being returned, right ? You can add some test at the very end of genfromtxt to check what to do with the output (whether to squeeze it or not, whether to transpose it or not)... If you don't mind doing it, I'd be quite grateful (I don't have time to work on numpy these days, much to my regret). Don't forget to change the user manual as well...
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
On 05/31/2011 10:18 AM, Pierre GM wrote:
On May 31, 2011, at 4:53 PM, Derek Homeier wrote:
Hi Chris,
On 31 May 2011, at 13:56, cgraves wrote:
I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen? you are mistaken, genfromtxt is not using loadtxt (and could not possibly, since it has the more complex parser to handle missing data); thus ndmin could not be inherited automatically. It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it. Oh, that shouldn't be too difficult: 'ndmin' tells whether the array must be squeezed before being returned, right ? You can add some test at the very end of genfromtxt to check what to do with the output (whether to squeeze it or not, whether to transpose it or not)... If you don't mind doing it, I'd be quite grateful (I don't have time to work on numpy these days, much to my regret). Don't forget to change the user manual as well...
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion (Different function so different ticket.)
Sure you can change the end of the code but that may hide various problem. Unlike loadtxt, genfromtxt has a lot of flexibility especially handling missing values and using converter functions. So I think that some examples must be provided that can not be handled by providing a suitable converter or that require multiple assumptions about input file (such as having more than one delimiter). Bruce
![](https://secure.gravatar.com/avatar/09939f25b639512a537ce2c90f77f958.jpg?s=120&d=mm&r=g)
On Tue, May 31, 2011 at 10:33 AM, Bruce Southey <bsouthey@gmail.com> wrote:
On 05/31/2011 10:18 AM, Pierre GM wrote:
On May 31, 2011, at 4:53 PM, Derek Homeier wrote:
Hi Chris,
On 31 May 2011, at 13:56, cgraves wrote:
I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen? you are mistaken, genfromtxt is not using loadtxt (and could not possibly, since it has the more complex parser to handle missing data); thus ndmin could not be inherited automatically. It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it. Oh, that shouldn't be too difficult: 'ndmin' tells whether the array must be squeezed before being returned, right ? You can add some test at the very end of genfromtxt to check what to do with the output (whether to squeeze it or not, whether to transpose it or not)... If you don't mind doing it, I'd be quite grateful (I don't have time to work on numpy these days, much to my regret). Don't forget to change the user manual as well...
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion (Different function so different ticket.)
Sure you can change the end of the code but that may hide various problem. Unlike loadtxt, genfromtxt has a lot of flexibility especially handling missing values and using converter functions. So I think that some examples must be provided that can not be handled by providing a suitable converter or that require multiple assumptions about input file (such as having more than one delimiter).
Bruce
At this point, I wonder if it might be smarter to create a .atleast_Nd() function and use that everywhere it is needed. Having similar logic tailored for each loading function might be a little dangerous if bug fixes are made to one, but not the others. Ben Root
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 31 May 2011, at 17:45, Benjamin Root wrote:
At this point, I wonder if it might be smarter to create a .atleast_Nd() function and use that everywhere it is needed. Having similar logic tailored for each loading function might be a little dangerous if bug fixes are made to one, but not the others.
Like a generalised version of .atleast_1d / .atleast_2d? It would also have to include an .atmost_Nd functionality of some sorts, to replace the .squeeze(), generally a good idea (e.g. something like np.atleast_Nd(X, ndmin=0, ndmax=-1), where the default is not to reduce the maximum number of dimensions...). But for the io routines the situation is a bit more complex, since different shapes are expected to be returned depending on the text input (e.g. (1, N) for a single row vs. (N, 1) for a single column of data). Cheers, Derek
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 31 May 2011, at 17:33, Bruce Southey wrote:
It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it. Oh, that shouldn't be too difficult: 'ndmin' tells whether the array must be squeezed before being returned, right ? You can add some test at the very end of genfromtxt to check what to do with the output (whether to squeeze it or not, whether to transpose it or not)... If you don't mind doing it, I'd be quite grateful (I don't have time to work on numpy these days, much to my regret). Don't forget to change the user manual as well...
NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion (Different function so different ticket.)
Sure you can change the end of the code but that may hide various problem. Unlike loadtxt, genfromtxt has a lot of flexibility especially handling missing values and using converter functions. So I think that some examples must be provided that can not be handled by providing a suitable converter or that require multiple assumptions about input file (such as having more than one delimiter).
I think stuff like multiple delimiters should have been dealt with before, as the right place to insert the ndmin code (which includes the decision to squeeze or not to squeeze as well as to add additional dimensions, if required) would be right at the end before the 'unpack' switch, or rather replacing the bit: if usemask: output = output.view(MaskedArray) output._mask = outputmask if unpack: return output.squeeze().T return output.squeeze() But there it's already not clear to me how to deal with the MaskedArray case... Cheers, Derek
![](https://secure.gravatar.com/avatar/56b215661867f3b4f4a3b28077de66b3.jpg?s=120&d=mm&r=g)
On May 31, 2011, at 5:52 PM, Derek Homeier wrote:
I think stuff like multiple delimiters should have been dealt with before, as the right place to insert the ndmin code (which includes the decision to squeeze or not to squeeze as well as to add additional dimensions, if required) would be right at the end before the 'unpack' switch, or rather replacing the bit:
if usemask: output = output.view(MaskedArray) output._mask = outputmask if unpack: return output.squeeze().T return output.squeeze()
But there it's already not clear to me how to deal with the MaskedArray case...
Oh, easy. You need to replace only the last three lines of genfromtxt with the ones from loadtxt (808-833). Then, if usemask is True, you need to use ma.atleast_Xd instead of np.atleast_Xd. Et voilà. Comments: * I would raise an exception if ndmin isn't correct *before* trying to read the file... * You could define a `collapse_function` that would be `np.atleast_1d`, `np.atleast_2d`, `ma.atleast_1d`... depending on the values of `usemask` and `ndmin`... If you have any question about numpy.ma, don't hesitate to contact me directly.
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 31 May 2011, at 18:25, Pierre GM wrote:
On May 31, 2011, at 5:52 PM, Derek Homeier wrote:
I think stuff like multiple delimiters should have been dealt with before, as the right place to insert the ndmin code (which includes the decision to squeeze or not to squeeze as well as to add additional dimensions, if required) would be right at the end before the 'unpack' switch, or rather replacing the bit:
if usemask: output = output.view(MaskedArray) output._mask = outputmask if unpack: return output.squeeze().T return output.squeeze()
But there it's already not clear to me how to deal with the MaskedArray case...
Oh, easy. You need to replace only the last three lines of genfromtxt with the ones from loadtxt (808-833). Then, if usemask is True, you need to use ma.atleast_Xd instead of np.atleast_Xd. Et voilà. Comments: * I would raise an exception if ndmin isn't correct *before* trying to read the file... * You could define a `collapse_function` that would be `np.atleast_1d`, `np.atleast_2d`, `ma.atleast_1d`... depending on the values of `usemask` and `ndmin`... If you have any question about numpy.ma, don't hesitate to contact me directly.
Thanks for the directions! I was not sure about the usemask case because it presently does not invoke .squeeze() either... On a possibly related note, genfromtxt also treats the 'unpack'ing of structured arrays differently from loadtxt (which returns a list of arrays in that case) - do you know if this is on purpose, or also rather missing functionality (I guess it might break recfromtxt()...)? Cheers, Derek
![](https://secure.gravatar.com/avatar/56b215661867f3b4f4a3b28077de66b3.jpg?s=120&d=mm&r=g)
On May 31, 2011, at 6:37 PM, Derek Homeier wrote:
On 31 May 2011, at 18:25, Pierre GM wrote:
On May 31, 2011, at 5:52 PM, Derek Homeier wrote:
I think stuff like multiple delimiters should have been dealt with before, as the right place to insert the ndmin code (which includes the decision to squeeze or not to squeeze as well as to add additional dimensions, if required) would be right at the end before the 'unpack' switch, or rather replacing the bit:
if usemask: output = output.view(MaskedArray) output._mask = outputmask if unpack: return output.squeeze().T return output.squeeze()
But there it's already not clear to me how to deal with the MaskedArray case...
Oh, easy. You need to replace only the last three lines of genfromtxt with the ones from loadtxt (808-833). Then, if usemask is True, you need to use ma.atleast_Xd instead of np.atleast_Xd. Et voilà. Comments: * I would raise an exception if ndmin isn't correct *before* trying to read the file... * You could define a `collapse_function` that would be `np.atleast_1d`, `np.atleast_2d`, `ma.atleast_1d`... depending on the values of `usemask` and `ndmin`... If you have any question about numpy.ma, don't hesitate to contact me directly.
Thanks for the directions! I was not sure about the usemask case because it presently does not invoke .squeeze() either...
The idea is that if `usemask` is True, you build a second array (the mask), that you attach to your main array at the very end (in the `output=output.view(MaskedArray), output._mask = mask` combo...). Afterwards, it's a regular MaskedArray that supports the .squeeze() method...
On a possibly related note, genfromtxt also treats the 'unpack'ing of structured arrays differently from loadtxt (which returns a list of arrays in that case) - do you know if this is on purpose, or also rather missing functionality (I guess it might break recfromtxt()...)?
Keep in mind that I haven't touched genfromtxt since 8-10 months or so. I wouldn't be surprised that it were lagging a bit behind loadtxt in terms of development. Yes, there'll be some tweaking to do for recfromtxt (it's OK for now if `ndmin` and `unpack` are the defaults) and others, but nothing major.
![](https://secure.gravatar.com/avatar/7431612bd9d490e3497f94740b7da0db.jpg?s=120&d=mm&r=g)
On 31 May 2011, at 21:28, Pierre GM wrote:
On May 31, 2011, at 6:37 PM, Derek Homeier wrote:
On 31 May 2011, at 18:25, Pierre GM wrote:
On May 31, 2011, at 5:52 PM, Derek Homeier wrote:
I think stuff like multiple delimiters should have been dealt with before, as the right place to insert the ndmin code (which includes the decision to squeeze or not to squeeze as well as to add additional dimensions, if required) would be right at the end before the 'unpack' switch, or rather replacing the bit:
if usemask: output = output.view(MaskedArray) output._mask = outputmask if unpack: return output.squeeze().T return output.squeeze()
But there it's already not clear to me how to deal with the MaskedArray case...
Oh, easy. You need to replace only the last three lines of genfromtxt with the ones from loadtxt (808-833). Then, if usemask is True, you need to use ma.atleast_Xd instead of np.atleast_Xd. Et voilà. Comments: * I would raise an exception if ndmin isn't correct *before* trying to read the file... * You could define a `collapse_function` that would be `np.atleast_1d`, `np.atleast_2d`, `ma.atleast_1d`... depending on the values of `usemask` and `ndmin`...
Thanks, that helped to clean up a little bit.
If you have any question about numpy.ma, don't hesitate to contact me directly.
Thanks for the directions! I was not sure about the usemask case because it presently does not invoke .squeeze() either...
The idea is that if `usemask` is True, you build a second array (the mask), that you attach to your main array at the very end (in the `output=output.view(MaskedArray), output._mask = mask` combo...). Afterwards, it's a regular MaskedArray that supports the .squeeze() method...
OK, in both cases output.squeeze() is now used if ndim>ndmin and usemask is False - at least it does not break any tests, so it seems to work with MaskedArrays as well.
On a possibly related note, genfromtxt also treats the 'unpack'ing of structured arrays differently from loadtxt (which returns a list of arrays in that case) - do you know if this is on purpose, or also rather missing functionality (I guess it might break recfromtxt()...)?
Keep in mind that I haven't touched genfromtxt since 8-10 months or so. I wouldn't be surprised that it were lagging a bit behind loadtxt in terms of development. Yes, there'll be some tweaking to do for recfromtxt (it's OK for now if `ndmin` and `unpack` are the defaults) and others, but nothing major.
Well, at long last I got to implement the above and added the corresponding tests for genfromtxt - with the exception of the dimension-0 cases, since genfromtxt raises an error on empty files. There already is a comment it should perhaps better return an empty array, so I am putting that idea up for discussion here again. I tried to devise a very basic test with masked arrays, just added to test_withmissing now. I also implemented the same unpacking behaviour for structured arrays and just made recfromtxt set unpack=False to work (or should it issue a warning?). The patches are up for review as commit 8ac01636 in my iocnv-wildcard branch: https://github.com/dhomeier/numpy/compare/master...iocnv-wildcard Cheers, Derek
![](https://secure.gravatar.com/avatar/52c7382c93a8c0df84ac5457c7e27cf0.jpg?s=120&d=mm&r=g)
Derek Homeier wrote:
Hi Chris,
On 31 May 2011, at 13:56, cgraves wrote:
I've downloaded the latest numpy (1.6.0) and loadtxt has the ndmin option, however neither genfromtxt nor recfromtxt, which use loadtxt, have it. Should they have inherited the option? Who can make it happen?
you are mistaken, genfromtxt is not using loadtxt (and could not possibly, since it has the more complex parser to handle missing data); thus ndmin could not be inherited automatically. It certainly would make sense to provide the same functionality for genfromtxt (which should then be inherited by [nd,ma,rec]fromtxt), so I'd go ahead and file a feature (enhancement) request. I can't promise I can take care of it myself, as I am less familiar with genfromtxt, but I'd certainly have a look at it.
Does anyone have an opinion whether this is a case for reopening (yet again...) http://projects.scipy.org/numpy/ticket/1562 or create a new ticket?
Thanks Derek. That would be greatly appreciated! Based on the follow-up messages in this thread, it looks like (hopefully) there will not be too much additional work in implementing it. For now I'll just use the temporary fix, a .reshape(-1), on any recfromtxt's that might read in a single row of data.. Kind regards, Chris -- View this message in context: http://old.nabble.com/loadtxt-savetxt-tickets-tp31238871p31769169.html Sent from the Numpy-discussion mailing list archive at Nabble.com.
participants (10)
-
Benjamin Root
-
Bruce Southey
-
cgraves
-
Charles R Harris
-
Derek Homeier
-
Paul Anton Letnes
-
Pauli Virtanen
-
Pierre GM
-
Ralf Gommers
-
Stéfan van der Walt