Re: [Numpy-discussion] More loadtxt() changes

25 Nov 2008

      On Tue, Nov 25, 2008 at 12:16 PM, Pierre GM  wrote:
...
A la mlab.csv2rec ? It could work with a bit more tweaking, basically
following John Hunter's et al. path. What happens when the column names are
unknown (read from the header) or wrong ?
Actually, I'd like John to comment on that, hence the CC. More generally,
wouldn't be useful to push the recarray manipulating functions from
matplotlib.mlab to numpy ?
Yes, I've said on a number of occasions I'd like to see these
functions in numpy, since a number of them make more sense as numpy
methods than as stand alone functions.
...
What happens when the column names are unknown (read from the header) or wrong ?
I'm not quite sure what you are looking for here.  Either the user
will have to know the correct column name or the column number or you
should raise an error.  I think supporting column names everywhere
they make sense is critical since this is how most people think about
these CSV-like files with column headers.

One other thing that is essential for me is that date support is
included.  Virtually every CSV file I work with has date data in it,
in a variety of formats, and I depend on csv2rec (via
dateutil.parser.parse which mpl ships) to be able to handle it w/o any
extra cognitive overhead, albeit at the expense of some performance
overhead, but my files aren't too big.  I'm not sure how numpy would
handle the date parsing aspect, but this came up in the date datatype
PEP discussion I think.  For me, having to manually specify a date
converter with the proper format string every time I load a CSV file
is probably not viable.

Another feature that is critical to me is to be able to get a
np.recarray back instead of a record array.  I use these all day long,
and the convenience of r.date over r['date'] is too much for me to
give up.

Feel free to ignore these suggestions if they are too burdensome or
not appropriate for numpy -- I'm just letting you know some of the
things I need to see before I personally would stop using mlab.csv2rec
 and use numpy.loadtxt instead.

One last thing, I consider the masked array support in csv2rec
somewhat broken because when using a masked array you cannot get at
the data (eg datetime methods or string methods) directly using the
same interface that regular recarrays use.  Pierre, last I brought
this up you asked for some example code and indicated a willingness to
work on it but I fell behind and never posted it.  The code
illustrating the problem is below.  I'm really not sure what the right
solution is, but the current implementation -- sometimes returning a
plain-vanilla rec array, sometimes returning a masked record array --
with different interfaces is not good.

Perhaps the best solution is to force the user to ask for masked
support, and then always return a masked array whether any of the data
is masked or not.  csv2rec conditionally returns a masked array only
if some of the data are masked, which makes it difficult to use.

JDH

Here is the problem I referred to above -- in f1 none of the rows are
masked and so I can access the object attributes from the rows
directly.  In the 2nd example, row 3 has some missing data so I get an
mrecords recarray back, which does not allow me to directly access the
valid data methods.

from  StringIO import StringIO
import matplotlib.mlab as mlab
f1 = StringIO("""\
date,name,age,weight
2008-10-12,'Bill',22,125.
2008-10-13,'Tom',23,135.
2008-10-14,'Sally',23,145."""
)

r1 = mlab.csv2rec(f1)
row0 = r1[0]
print row0.date.year, row0.name.upper()

f2 = StringIO("""\
date,name,age,weight
2008-10-12,'Bill',22,125.
2008-10-13,'Tom',23,135.
2008-10-14,'',,145."""
)

r2 = mlab.csv2rec(f2)
row0 = r2[0]
print row0.date.year, row0.name.upper()