[SciPy-User] Suggestion for numpy.genfromtxt documentation

Fri Oct 9 11:21:06 EDT 2009

On 10/09/2009 09:21 AM, Skipper Seabold wrote:
> On Wed, Oct 7, 2009 at 3:20 PM, Bruce Southey<bsouthey at gmail.com>  wrote:
>    
>> On 10/07/2009 10:52 AM, Skipper Seabold wrote:
>>      
>>> On Wed, Oct 7, 2009 at 11:25 AM, Dharhas Pothina
>>> <Dharhas.Pothina at twdb.state.tx.us>    wrote:
>>>
>>>        
>>>> Hi,
>>>>
>>>> It took me a while and a lot of trial and error to work out why this didn't work as expected.
>>>>
>>>> data = np.genfromtxt(fname,usecols=(2,3,4),names='x,y,z')
>>>>
>>>> this command works and does not return any warnings or errors, but returns an numpy array with no field names. If you use:
>>>>
>>>> data = np.genfromtxt(fname,usecols=(2,3,4),dtype=None,names='x,y,z')
>>>>
>>>> then the command does what I expect it to and returns a structured numpy array with field names. So essentially, the 'names' argument doesn't not work unless you also specify the 'dtype' argument.
>>>>
>>>>          
>> What did you actually expect?
>> It would be very informative if you could provide a simple example of
>> this for testing.
>>
>> There are many combinations of arguments so not all have been tested and
>> it is not always clear what the expected behavior should be.
>>
>>      
>>>> I think, it would be less confusing to new users to either have this explicitly mentioned in the documentation string for the genfromtxt 'names' argument or to have the function default to 'dtype=None'  if the 'names' argument is specified without specifying the 'dtype' argument.
>>>>
>>>> - dharhas
>>>>
>>>>          
>>> I came across this behavior recently and agree with you.  There is a
>>> patch in the works for this.
>>>
>>> See this thread: http://thread.gmane.org/gmane.comp.python.numeric.general/33479
>>>
>>> And this ticket: http://projects.scipy.org/numpy/ticket/1252
>>>
>>> Cheers,
>>>
>>> Skipper
>>>
>>>        
>>   From the numpy help, there is this example:
>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
>> ('mystring','S5')], delimiter=",")
>>
>> It does not help that the dtype of structured arrays also includes the
>> actual name. So I do not think we can use dtype argument without using
>> the combination of dtype and name. Perhaps if dtype is split into names
>> and formats so that dtype=('name', 'format').
>>
>> In some sense you are suggesting that we should have something like:
>>
>>      
> With the defaultfmt keyword added and the new changes here is the
> current state of things.
>    
Which version is that?
(Okay I update from SVN but not tried to build it due to the recent 
issues about import)

> from StringIO import StringIO
> import numpy as np
>
> s = StringIO("1,2,3.0")
>
>    
>> Ignore the use of None and True for dtype and names arguments:
>> i) If only dtype is only specified then use the specified dtype and add
>> default names such as col1, col2,... if necessary
>>
>>      
> This gives a plain array, so no default names are used.
>
> data = np.genfromtxt(s, delimiter=",") # dtype=float
>
> In [54]: data
> Out[54]: array([ 1.,  2.,  3.])
>    
Rats, I forgot about plain arrays. But this is a bug because the default 
argument is defaultfmt="f%i". But I this option is kept then I think the 
default argument of defaultfmt should be None.
> If default names are specified then it doesn't seem to pick them up as
> of right now.
>
> s.seek(0)
> data = np.genfromtxt(s, delimiter=",", defaultfmt="Var%i")
>
> In [79]: data
> Out[79]: array([ 1.,  2.,  3.])
>    
This is also a bug.
>
>    
>> ii) If names is only specified then contruct the dtype as ('name',
>> 'default format')
>>      
> s.seek(0)
> data = np.genfromtxt(s, delimiter=",", names=['var1','var2','var3'])
> #dtype = float
>
> In [57]: data
> Out[57]:
> array((1.0, 2.0, 3.0),
>        dtype=[('var1', '<f8'), ('var2', '<f8'), ('var3', '<f8')])
>
>    
Excellent as what I expected.
>> iii) If formats is only specified then construct the dtype as ('default
>> name', 'format')
>>      
> This doesn't seem to work with the new easy dtype as noted above.
>
> But this does
>
> data = np.genfromtxt(s, delimiter=",", dtype=(int,int,float),
> defaultfmt="var%i")
>
> In [72]: data
> Out[72]:
> array((1, 2, 3.0),
>        dtype=[('var0', '<i8'), ('var1', '<i8'), ('var2', '<f8')])
>
>    
I forgot that a plain array could be desired.
>> iv) If only names and formats are only specified then construct the
>> dtype as ('name', 'format')
>>
>>      
> So I think this means,
>
> s.seek(0)
> data = np.genfromtxt(s, delimiter=",", dtype=(int,int,float),
> names=['var1','var2','var3'])
>
> In [86]: data
> Out[86]:
> array((1, 2, 3.0),
>        dtype=[('var1', '<i8'), ('var2', '<i8'), ('var3', '<f8')])
>
>    
Yes that is what I meant.
>    
>> v) If no dtype, names and formats are only specified then construct the
>> dtype as ('default name', 'default format')
>>
>>      
> Same case as above I think where
>
> s.seek(0)
> data = np.genfromtxt(s, delimiter=",", defaultfmt="var%i")
>
> doesn't work as "expected" to zip float (the default format) with the
> default name, specified by defaultfmt.
>    
Again I did forget about having a plain array which would be the case here.
>    
>> vi) If dtype and names or formats are specified then use dtype if it is
>> of the form ('name', 'format') or use one of the previous cases.
>>
>>      
> This seems to be the case for defaultfmt,
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=[('var1',int),('var2',int),('var3',float)], delimiter=",",
> defaultfmt="VAR%i")
>
> In [99]: data
> Out[99]:
> array((1, 2, 3.0),
>        dtype=[('var1', '<i8'), ('var2', '<i8'), ('var3', '<f8')])
>
> But if names is specified, then it's never ignored
>
> s.seek(0)
> data = np.genfromtxt(s,
> dtype=[('var1',int),('var2',int),('var3',float)], delimiter=",",
> names=['VAR1','VAR2','VAR3'])
>
> In [102]: data
> Out[102]:
> array((1, 2, 3.0),
>        dtype=[('VAR1', '<i8'), ('VAR2', '<i8'), ('VAR3', '<f8')])
>
>    
Here the problem is which user input overrides the other. As long as it 
is clearly documented what happens then I do not care (I care when 
things are not stated).
>> When dtype is None this implies format is None so the format is obtained
>> from the data. If names is not True then the names are either from the
>> argument or default values.
>>
>>      
> Well, genfromtxt returns plain arrays too, so if Names is not True or
> an argument, then we can't give default values.  I think defaultfmt
> should have a True argument as well, that way you can return a
> structured array with f0, f1, f2 as the names if that's what you want.
>    
Yes, I forgot about the plain array case or having no named fields.

But I think that this could still be handled by the names argument. So 
that if a user does not specify any name (name=None) and no dtype (or 
all columns have the same dtype) then we have to return a plain array. I 
presume because I have not tested it, that names='var%i' should work. So 
that we could have 'names=False' that would be that same as 
names='var%i'. Also, I think that a structured array results from when 
different dtypes are specified so that should automatically have the 
same effect as  names='var%i'.

>> If names argument is True then the names should be read from the data
>> and one of the previous cases apply.
>>
>>      
> It's a bit confusing to think of data type "formats" and have the
> defaultfmt, perhaps it should be defaultnm?
>    
I agree. With formats, I expect things like different character and 
numeric types. If we can add this to the names argument then we should 
not need it.

> So in sum, I think we should maybe have a True argument for
> defaultfmt, maybe change the name to defaultnm to avoid confusion, and
> have it so the easy dtype construction works with defaultfmt.  I will
> comment on the open tickets.
>
> Anything I missed?
>
> Skipper
>    
Excellent job but you missed the case of the names supplied in the header.

Bruce