On Nov 25, 2008, at 2:06 PM, Ryan May wrote:
1) It looks like the function returns a structured array rather than a rec array, so that fields are obtained by doing a dictionary access. Since it's a dictionary access, is there any reason that the header needs to be munged to replace characters and reserved names? IIUC, csv2rec changes names b/c it returns a rec array, which uses attribute lookup and hence all names need to be valid python identifiers. This is not the case for a structured array.
Personally, I prefer flexible ndarrays to recarrays, hence the output. However, I still think that names should be as clean as possible to avoid bad surprises down the road.
2) Can we avoid the use of seek() in here? I just posted a patch to change the check to readline, which was the only file function used previously. This allowed the direct use of a file-like object returned by urllib2.urlopen().
I coded that a couple of weeks ago, before you posted your patch and I didn't have tme to check it. Yes, we could try getting rid of seek. However, we need to find a way to rewind to the beginning of the file if the dtypes are not given in input (as we parsed the whole file to find the best converter in that case).
3) In order to avoid breaking backwards compatibility, can we change to default for dtype to be float32, and instead use some kind of special value ('auto' ?) to use the automatic dtype determination?
I'm not especially concerned w/ backwards compatibility, because we're supporting masked values (something that np.loadtxt shouldn't have to worry about). Initially, I needed a replacement to the fromfile function in the scikits.timeseries.trecords package. I figured it'd be easier and more portable to get a function for generic masked arrays, that could be adapted afterwards to timeseries. In any case, I was more considering the functions I send you to be part of some numpy.ma.io module than a replacement to np.loadtxt. I tried to get the syntax as close as possible to np.loadtxt and mlab.csv2rec, but there'll always be some differences. So, yes, we could try to use a default dtype=float and yes, we could have an extra parameter 'auto'. But is it really that useful ? I'm not sure (well, no, I'm sure it's not...)
I'm currently cooking up some of these changes myself, but thought I would see what you thought first.