[Numpy-discussion] loadtxt and usecols

Sebastian Berg sebastian at sipsolutions.net
Tue Nov 10 10:57:26 EST 2015


On Di, 2015-11-10 at 10:24 -0500, Benjamin Root wrote:
> Just pointing out np.loadtxt(..., ndmin=2) will always return a 2D
> array. Notice that without that option, the result is effectively
> squeezed. So if you don't specify that option, and you load up a CSV
> file with only one row, you will get a very differently shaped array
> than if you load up a CSV file with two rows.
> 

Oh, well I personally think that default squeeze is an abomination :).

Anyway, I just wanted to point out that it is two different possible
logics, and we have to pick one.
I have a slight preference for the indexing/array-like interpretation,
but I am aware that from a usage point of view the sequence one is
likely better.
I could throw in another option: Throw an explicit error instead of the
general.

Anyway, I *really* do not have an opinion about what is better.

Array-like would only suggest that you also accept buffer interface
objects or array_interface stuff. Which in this case is really
unnecessary I think.

- Sebastian


> 
> Ben Root
> 
> 
> On Tue, Nov 10, 2015 at 10:07 AM, Irvin Probst
> <irvin.probst at ensta-bretagne.fr> wrote:
>         On 10/11/2015 14:17, Sebastian Berg wrote:
>                 Actually, it is the "sequence special case" type ;).
>                 (matlab does not
>                 have this, since matlab always returns 2-D I
>                 realized).
>                 
>                 As I said, if usecols is like indexing, the result
>                 should mimic:
>                 
>                 arr = np.loadtxt(f)
>                 arr = arr[usecols]
>                 
>                 in which case a 1-D array is returned if you put in a
>                 scalar into
>                 usecols (and you could even generalize usecols to
>                 higher dimensional
>                 array-likes).
>                 The way you implemented it -- which is fine, but I
>                 want to stress that
>                 there is a real decision being made here --, you
>                 always see it as a
>                 sequence but allow a scalar for convenience (i.e.
>                 always return a 2-D
>                 array). It is a `sequence of ints or int` type
>                 argument and not an
>                 array-like argument in my opinion.
>         
>         I think we have two separate problems here:
>         
>         The first one is whether loadtxt should always return a 2D
>         array or should it match the shape of the usecol argument.
>         From a CS guy point of view I do understand your concern here.
>         Now from a teacher point of view I know many people expect to
>         get a "matrix" (thank you Matlab...) and the "purity" of
>         matching the dimension of the usecol variable will be seen by
>         many people [1] as a nerdy useless heavyness noone cares of
>         (no offense). So whatever you, seadoned numpy devs from this
>         mailing list, decide I think it should be explained in the
>         docstring with a very clear wording.
>         
>         My own opinion on this first problem is that loadtxt() should
>         always return a 2D array, no less, no more. If I write
>         np.loadtxt(f)[42] it means I want to read the whole file and
>         then I explicitely ask for transforming the 2-D array
>         loadtxt() returned into a 1-D array. Otoh if I write
>         loadtxt(f, usecol=42) it means I don't want to read the other
>         columns and I want only this one, but it does not mean that I
>         want to change the returned array from 2-D to 1-D. I know this
>         new behavior might break a lot of existing code as
>         usecol=(42,) used to return a 1-D array, but
>         usecol=((((42,)))) also returns a 1-D array so the current
>         behavior is not consistent imho.
>         
>         The second problem is about the wording in the docstring, when
>         I see "sequence of int or int" I uderstand I will have to cast
>         into a 1-D python list whatever wicked N-dimensional object I
>         use to store my column indexes, or hope list(my_object) will
>         do it fine. On the other hand when I read "array-like" the
>         function is telling me I don't have to worry about my object,
>         as long as numpy knows how to cast it into an array it will be
>         fine.
>         
>         Anyway I think something like that:
>         
>         import numpy as np
>         a=[[[2,],[],[],],[],[],[]]
>         foo=np.loadtxt("CONCARNEAU_2010.txt", usecols=a)
>         
>         should just work and return me a 2-D (or 1-D if you like)
>         array with the data I asked for and I don't think "a" here is
>         an int or a sequence of int (but it's a good example of why
>         loadtxt() should not match the shape of the usecol argument).
>         
>         To make it short, let the reading function read the data in a
>         consistent and predictible way and then let the user
>         explicitely change the data's shape into anything he likes.
>         
>         Regards.
>         
>         [1] read non CS people trying to switch to numpy/scipy
>         
>         _______________________________________________
>         NumPy-Discussion mailing list
>         NumPy-Discussion at scipy.org
>         https://mail.scipy.org/mailman/listinfo/numpy-discussion
>         
> 
> 
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20151110/772e6f93/attachment.sig>


More information about the NumPy-Discussion mailing list