[Numpy-discussion] loadtxt ndmin option

Ralf Gommers ralf.gommers at googlemail.com
Thu May 5 16:31:25 EDT 2011


On Thu, May 5, 2011 at 9:46 PM, Benjamin Root <ben.root at ou.edu> wrote:

>
>
> On Thu, May 5, 2011 at 2:33 PM, Ralf Gommers <ralf.gommers at googlemail.com>wrote:
>
>>
>>
>> On Thu, May 5, 2011 at 9:18 PM, Benjamin Root <ben.root at ou.edu> wrote:
>>
>>>
>>>
>>> On Thu, May 5, 2011 at 1:08 PM, Paul Anton Letnes <
>>> paul.anton.letnes at gmail.com> wrote:
>>>
>>>>
>>>> On 5. mai 2011, at 08.49, Benjamin Root wrote:
>>>>
>>>> >
>>>> >
>>>> > On Wed, May 4, 2011 at 11:08 PM, Paul Anton Letnes <
>>>> paul.anton.letnes at gmail.com> wrote:
>>>> >
>>>> > On 4. mai 2011, at 20.33, Benjamin Root wrote:
>>>> >
>>>> > > On Wed, May 4, 2011 at 7:54 PM, Derek Homeier <
>>>> derek at astro.physik.uni-goettingen.de> wrote:
>>>> > > On 05.05.2011, at 2:40AM, Paul Anton Letnes wrote:
>>>> > >
>>>> > > > But: Isn't the numpy.atleast_2d and numpy.atleast_1d functions
>>>> written for this? Shouldn't we reuse them? Perhaps it's overkill, and
>>>> perhaps it will reintroduce the 'transposed' problem?
>>>> > >
>>>> > > Yes, good point, one could replace the
>>>> > > X.shape = (X.size, ) with X = np.atleast_1d(X),
>>>> > > but for the ndmin=2 case, we'd need to replace
>>>> > > X.shape = (X.size, 1) with X = np.atleast_2d(X).T -
>>>> > > not sure which solution is more efficient in terms of memory access
>>>> etc...
>>>> > >
>>>> > > Cheers,
>>>> > >                                                Derek
>>>> > >
>>>> > >
>>>> > > I can confirm that the current behavior is not sufficient for all of
>>>> the original corner cases that ndmin was supposed to address.  Keep in mind
>>>> that np.loadtxt takes a one-column data file and a one-row data file down to
>>>> the same shape.  I don't see how the current code is able to produce the
>>>> correct array shape when ndmin=2.  Do we have some sort of counter in
>>>> loadtxt for counting the number of rows and columns read?  Could we use
>>>> those to help guide the ndmin=2 case?
>>>> > >
>>>> > > I think that using atleast_1d(X) might be a bit overkill, but it
>>>> would be very clear as to the code's intent.  I don't think we have to worry
>>>> about memory usage if we limit its use to only situations where ndmin is
>>>> greater than the number of dimensions of the array.  In those cases, the
>>>> array is either an empty result, a scalar value (in which memory access is
>>>> trivial), or 1-d (in which a transpose is cheap).
>>>> >
>>>> > What if one does things the other way around - avoid calling squeeze
>>>> until _after_ doing the atleast_Nd() magic? That way the row/column
>>>> information should be conserved, right? Also, we avoid transposing, memory
>>>> use, ...
>>>> >
>>>> > Oh, and someone could conceivably have a _looong_ 1D file, but would
>>>> want it read as a 2D array.
>>>> >
>>>> > Paul
>>>> >
>>>> >
>>>> >
>>>> > @Derek, good catch with noticing the error in the tests. We do still
>>>> need to handle the case I mentioned, however.  I have attached an example
>>>> script to demonstrate the issue.  In this script, I would expect the
>>>> second-to-last array to be a shape of (1, 5).  I believe that the
>>>> single-row, multi-column case would actually be the more common type of
>>>> edge-case encountered by users than the others.  Therefore, I believe that
>>>> this ndmin fix is not adequate until this is addressed.
>>>> >
>>>> > @Paul, we can't call squeeze after doing the atleast_Nd() magic.  That
>>>> would just undo whatever we had just done.  Also, wrt the transpose, a (1,
>>>> 100000) array looks the same in memory as a (100000, 1) array, right?
>>>> Agree. I thought more along the lines of (pseudocode-ish)
>>>> if ndmin == 0:
>>>>        squeeze()
>>>> if ndmin == 1:
>>>>        atleast_1D()
>>>> elif ndmin == 2:
>>>>        atleast_2D()
>>>> else:
>>>>        I don't rightly know what would go here, maybe raise ValueError?
>>>>
>>>> That would avoid the squeeze call before the atleast_Nd magic. But the
>>>> code was changed, so I think my comment doesn't make sense anymore. It's
>>>> probably fine the way it is!
>>>>
>>>> Paul
>>>>
>>>>
>>> I have thought of that too, but the problem with that approach is that
>>> after reading the file, X will have 2 or 3 dimensions, regardless of how
>>> many singleton dims were in the file.  A squeeze will always be needed.
>>> Also, the purpose of squeeze is opposite that of the atleast_*d()
>>> functions:  squeeze reduces dimensions, while atleast_*d will add
>>> dimensions.
>>>
>>> Therefore, I re-iterate... the patch by Derek gets the job done.  I have
>>> tested it for a wide variety of inputs for both regular arrays and record
>>> arrays.  Is there room for improvements?  Yes, but I think that can wait for
>>> later.  Derek's patch however fixes an important bug in the ndmin
>>> implementation and should be included for the release.
>>>
>>> Two questions: can you point me to the patch/ticket, and is this a
>> regression?
>>
>> Thanks,
>> Ralf
>>
>>
>>
> I don't know if he did a pull-request or not, but here is the link he
> provided earlier in the thread.
>
> https://github.com/dhomeier/numpy/compare/master...ndmin-cols
>
> Technically, this is not a "regression" as the ndmin feature is new in this
> release.
>

Yes right, I forgot this was a recent change.


> However, the problem that ndmin is supposed to address is not fixed by the
> current implementation for the rc.  Essentially, a single-row, multi-column
> file with ndmin=2 comes out as a Nx1 array which is the same result for a
> multi-row, single-column file.  My feeling is that if we let the current
> implementation stand as is, and developers use it in their code, then fixing
> it in a later release would introduce more problems (maybe the devels would
> transpose the result themselves or something).  Better to fix it now in rc
> with the two lines of code (and the correction to the tests), then to
> introduce a buggy feature that will be hard to fix in future releases, IMHO.
>
> Looks okay, and I agree that it's better to fix it now. The timing is a bit
unfortunate though, just after RC2. I'll have closer look tomorrow and if it
can go in, probably tag RC3.

If in the meantime a few more people could test this, that would be helpful.

Ralf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110505/a5c7d0f6/attachment.html>


More information about the NumPy-Discussion mailing list