On Thu, May 5, 2011 at 2:33 PM, Ralf Gommers
On Thu, May 5, 2011 at 9:18 PM, Benjamin Root
wrote: On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes < paul.anton.letnes@gmail.com> wrote:
On 5. mai 2011, at 08.49, Benjamin Root wrote:
On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <
paul.anton.letnes@gmail.com> wrote:
On 4. mai 2011, at 20.33, Benjamin Root wrote:
On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <
On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions written for this? Shouldn't we reuse them? Perhaps it's overkill, and
Yes, good point, one could replace the X.shape = (X.size, ) with X = np.atleast_1d(X), but for the ndmin=2 case, we'd need to replace X.shape = (X.size, 1) with X = np.atleast_2d(X).T - not sure which solution is more efficient in terms of memory access
etc...
Cheers, Derek
I can confirm that the current behavior is not sufficient for all of
I think that using atleast_1d(X) might be a bit overkill, but it
would be very clear as to the code's intent. I don't think we have to worry about memory usage if we limit its use to only situations where ndmin is greater than the number of dimensions of the array. In those cases, the array is either an empty result, a scalar value (in which memory access is
derek@astro.physik.uni-goettingen.de> wrote: perhaps it will reintroduce the 'transposed' problem? the original corner cases that ndmin was supposed to address. Keep in mind that np.loadtxt takes a one-column data file and a one-row data file down to the same shape. I don't see how the current code is able to produce the correct array shape when ndmin=2. Do we have some sort of counter in loadtxt for counting the number of rows and columns read? Could we use those to help guide the ndmin=2 case? trivial), or 1-d (in which a transpose is cheap).
What if one does things the other way around - avoid calling squeeze
until _after_ doing the atleast_Nd() magic? That way the row/column information should be conserved, right? Also, we avoid transposing, memory use, ...
Oh, and someone could conceivably have a _looong_ 1D file, but would
want it read as a 2D array.
Paul
@Derek, good catch with noticing the error in the tests. We do still
need to handle the case I mentioned, however. I have attached an example script to demonstrate the issue. In this script, I would expect the second-to-last array to be a shape of (1, 5). I believe that the single-row, multi-column case would actually be the more common type of edge-case encountered by users than the others. Therefore, I believe that this ndmin fix is not adequate until this is addressed.
@Paul, we can't call squeeze after doing the atleast_Nd() magic. That
would just undo whatever we had just done. Also, wrt the transpose, a (1, 100000) array looks the same in memory as a (100000, 1) array, right? Agree. I thought more along the lines of (pseudocode-ish) if ndmin == 0: squeeze() if ndmin == 1: atleast_1D() elif ndmin == 2: atleast_2D() else: I don't rightly know what would go here, maybe raise ValueError?
That would avoid the squeeze call before the atleast_Nd magic. But the code was changed, so I think my comment doesn't make sense anymore. It's probably fine the way it is!
Paul
I have thought of that too, but the problem with that approach is that after reading the file, X will have 2 or 3 dimensions, regardless of how many singleton dims were in the file. A squeeze will always be needed. Also, the purpose of squeeze is opposite that of the atleast_*d() functions: squeeze reduces dimensions, while atleast_*d will add dimensions.
Therefore, I re-iterate... the patch by Derek gets the job done. I have tested it for a wide variety of inputs for both regular arrays and record arrays. Is there room for improvements? Yes, but I think that can wait for later. Derek's patch however fixes an important bug in the ndmin implementation and should be included for the release.
Two questions: can you point me to the patch/ticket, and is this a
regression?
Thanks, Ralf
I don't know if he did a pull-request or not, but here is the link he provided earlier in the thread. https://github.com/dhomeier/numpy/compare/master...ndmin-cols Technically, this is not a "regression" as the ndmin feature is new in this release. However, the problem that ndmin is supposed to address is not fixed by the current implementation for the rc. Essentially, a single-row, multi-column file with ndmin=2 comes out as a Nx1 array which is the same result for a multi-row, single-column file. My feeling is that if we let the current implementation stand as is, and developers use it in their code, then fixing it in a later release would introduce more problems (maybe the devels would transpose the result themselves or something). Better to fix it now in rc with the two lines of code (and the correction to the tests), then to introduce a buggy feature that will be hard to fix in future releases, IMHO. Ben Root