[Numpy-discussion] loadtxt ndmin option

Benjamin Root ben.root at ou.edu
Thu May 5 15:18:34 EDT 2011


On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes <
paul.anton.letnes at gmail.com> wrote:

>
> On 5. mai 2011, at 08.49, Benjamin Root wrote:
>
> >
> >
> > On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <
> paul.anton.letnes at gmail.com> wrote:
> >
> > On 4. mai 2011, at 20.33, Benjamin Root wrote:
> >
> > > On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <
> derek at astro.physik.uni-goettingen.de> wrote:
> > > On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
> > >
> > > > But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions
> written for this? Shouldn't we reuse them? Perhaps it's overkill, and
> perhaps it will reintroduce the 'transposed' problem?
> > >
> > > Yes, good point, one could replace the
> > > X.shape = (X.size, ) with X = np.atleast_1d(X),
> > > but for the ndmin=2 case, we'd need to replace
> > > X.shape = (X.size, 1) with X = np.atleast_2d(X).T -
> > > not sure which solution is more efficient in terms of memory access
> etc...
> > >
> > > Cheers,
> > >                                                Derek
> > >
> > >
> > > I can confirm that the current behavior is not sufficient for all of
> the original corner cases that ndmin was supposed to address.  Keep in mind
> that np.loadtxt takes a one-column data file and a one-row data file down to
> the same shape.  I don't see how the current code is able to produce the
> correct array shape when ndmin=2.  Do we have some sort of counter in
> loadtxt for counting the number of rows and columns read?  Could we use
> those to help guide the ndmin=2 case?
> > >
> > > I think that using atleast_1d(X) might be a bit overkill, but it would
> be very clear as to the code's intent.  I don't think we have to worry about
> memory usage if we limit its use to only situations where ndmin is greater
> than the number of dimensions of the array.  In those cases, the array is
> either an empty result, a scalar value (in which memory access is trivial),
> or 1-d (in which a transpose is cheap).
> >
> > What if one does things the other way around - avoid calling squeeze
> until _after_ doing the atleast_Nd() magic? That way the row/column
> information should be conserved, right? Also, we avoid transposing, memory
> use, ...
> >
> > Oh, and someone could conceivably have a _looong_ 1D file, but would want
> it read as a 2D array.
> >
> > Paul
> >
> >
> >
> > @Derek, good catch with noticing the error in the tests. We do still need
> to handle the case I mentioned, however.  I have attached an example script
> to demonstrate the issue.  In this script, I would expect the second-to-last
> array to be a shape of (1, 5).  I believe that the single-row, multi-column
> case would actually be the more common type of edge-case encountered by
> users than the others.  Therefore, I believe that this ndmin fix is not
> adequate until this is addressed.
> >
> > @Paul, we can't call squeeze after doing the atleast_Nd() magic.  That
> would just undo whatever we had just done.  Also, wrt the transpose, a (1,
> 100000) array looks the same in memory as a (100000, 1) array, right?
> Agree. I thought more along the lines of (pseudocode-ish)
> if ndmin == 0:
>        squeeze()
> if ndmin == 1:
>        atleast_1D()
> elif ndmin == 2:
>        atleast_2D()
> else:
>        I don't rightly know what would go here, maybe raise ValueError?
>
> That would avoid the squeeze call before the atleast_Nd magic. But the code
> was changed, so I think my comment doesn't make sense anymore. It's probably
> fine the way it is!
>
> Paul
>
>
I have thought of that too, but the problem with that approach is that after
reading the file, X will have 2 or 3 dimensions, regardless of how many
singleton dims were in the file.  A squeeze will always be needed.  Also,
the purpose of squeeze is opposite that of the atleast_*d() functions:
squeeze reduces dimensions, while atleast_*d will add dimensions.

Therefore, I re-iterate... the patch by Derek gets the job done.  I have
tested it for a wide variety of inputs for both regular arrays and record
arrays.  Is there room for improvements?  Yes, but I think that can wait for
later.  Derek's patch however fixes an important bug in the ndmin
implementation and should be included for the release.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110505/8cb0ca6a/attachment.html>


More information about the NumPy-Discussion mailing list