[Numpy-discussion] More loadtxt() changes

Tue Nov 25 14:37:42 EST 2008

Pierre GM wrote:
> On Nov 25, 2008, at 2:06 PM, Ryan May wrote:
>> 1) It looks like the function returns a structured array rather than a
>> rec array, so that fields are obtained by doing a dictionary access.
>> Since it's a dictionary access, is there any reason that the header
>> needs to be munged to replace characters and reserved names?  IIUC,
>> csv2rec changes names b/c it returns a rec array, which uses attribute
>> lookup and hence all names need to be valid python identifiers.   
>> This is
>> not the case for a structured array.
> 
> Personally, I prefer flexible ndarrays to recarrays, hence the output.  
> However, I still think that names should be as clean as possible to  
> avoid bad surprises down the road.

Ok, I'm not really partial to this, I just thought it would simplify. 
Your point is valid.

>> 2) Can we avoid the use of seek() in here?  I just posted a patch to
>> change the check to readline, which was the only file function used
>> previously.  This allowed the direct use of a file-like object  
>> returned
>> by urllib2.urlopen().
> 
> I coded that a couple of weeks ago, before you posted your patch and I  
> didn't have tme to check it. Yes, we could try getting rid of seek.  
> However, we need to find a way to rewind to the beginning of the file  
> if the dtypes are not given in input (as we parsed the whole file to  
> find the best converter in that case).

What about doing the parsing and type inference in a loop and holding 
onto the already split lines?  Then loop through the lines with the 
converters that were finally chosen?  In addition to making my usecase 
work, this has the benefit of not doing the I/O twice.

>> 3) In order to avoid breaking backwards compatibility, can we change  
>> to
>> default for dtype to be float32, and instead use some kind of special
>> value ('auto' ?) to use the automatic dtype determination?
> 
> I'm not especially concerned w/ backwards compatibility, because we're  
> supporting masked values (something that np.loadtxt shouldn't have to  
> worry about). Initially, I needed a replacement to the fromfile  
> function in the scikits.timeseries.trecords package. I figured it'd be  
> easier and more portable to get a function for generic masked arrays,  
> that could be adapted afterwards to timeseries. In any case, I was  
> more considering the functions I send you to be part of some  
> numpy.ma.io module than a replacement to np.loadtxt. I tried to get  
> the syntax as close as possible to np.loadtxt and mlab.csv2rec, but  
> there'll always be some differences.
> 
> So, yes, we could try to use a default dtype=float and yes, we could  
> have an extra parameter 'auto'. But is it really that useful ? I'm not  
> sure (well, no, I'm sure it's not...)

I understand you're not concerned with backwards compatibility, but with 
the exception of missing handling, which is probably specific to masked 
arrays, I was hoping to just add functionality to loadtxt().  Numpy 
doesn't need a separate text reader for most of this and breaking API 
for any of this is likely a non-starter.  So while, yes, having float be 
the default dtype is probably not the most useful, leaving it also 
doesn't break existing code.

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma