On Tue, Nov 25, 2008 at 12:16 PM, Pierre GM
A la mlab.csv2rec ? It could work with a bit more tweaking, basically following John Hunter's et al. path. What happens when the column names are unknown (read from the header) or wrong ?
Actually, I'd like John to comment on that, hence the CC. More generally, wouldn't be useful to push the recarray manipulating functions from matplotlib.mlab to numpy ?
Yes, I've said on a number of occasions I'd like to see these functions in numpy, since a number of them make more sense as numpy methods than as stand alone functions.
What happens when the column names are unknown (read from the header) or wrong ?
I'm not quite sure what you are looking for here. Either the user will have to know the correct column name or the column number or you should raise an error. I think supporting column names everywhere they make sense is critical since this is how most people think about these CSV-like files with column headers. One other thing that is essential for me is that date support is included. Virtually every CSV file I work with has date data in it, in a variety of formats, and I depend on csv2rec (via dateutil.parser.parse which mpl ships) to be able to handle it w/o any extra cognitive overhead, albeit at the expense of some performance overhead, but my files aren't too big. I'm not sure how numpy would handle the date parsing aspect, but this came up in the date datatype PEP discussion I think. For me, having to manually specify a date converter with the proper format string every time I load a CSV file is probably not viable. Another feature that is critical to me is to be able to get a np.recarray back instead of a record array. I use these all day long, and the convenience of r.date over r['date'] is too much for me to give up. Feel free to ignore these suggestions if they are too burdensome or not appropriate for numpy -- I'm just letting you know some of the things I need to see before I personally would stop using mlab.csv2rec and use numpy.loadtxt instead. One last thing, I consider the masked array support in csv2rec somewhat broken because when using a masked array you cannot get at the data (eg datetime methods or string methods) directly using the same interface that regular recarrays use. Pierre, last I brought this up you asked for some example code and indicated a willingness to work on it but I fell behind and never posted it. The code illustrating the problem is below. I'm really not sure what the right solution is, but the current implementation -- sometimes returning a plain-vanilla rec array, sometimes returning a masked record array -- with different interfaces is not good. Perhaps the best solution is to force the user to ask for masked support, and then always return a masked array whether any of the data is masked or not. csv2rec conditionally returns a masked array only if some of the data are masked, which makes it difficult to use. JDH Here is the problem I referred to above -- in f1 none of the rows are masked and so I can access the object attributes from the rows directly. In the 2nd example, row 3 has some missing data so I get an mrecords recarray back, which does not allow me to directly access the valid data methods. from StringIO import StringIO import matplotlib.mlab as mlab f1 = StringIO("""\ date,name,age,weight 2008-10-12,'Bill',22,125. 2008-10-13,'Tom',23,135. 2008-10-14,'Sally',23,145.""" ) r1 = mlab.csv2rec(f1) row0 = r1[0] print row0.date.year, row0.name.upper() f2 = StringIO("""\ date,name,age,weight 2008-10-12,'Bill',22,125. 2008-10-13,'Tom',23,135. 2008-10-14,'',,145.""" ) r2 = mlab.csv2rec(f2) row0 = r2[0] print row0.date.year, row0.name.upper()