[Numpy-discussion] More loadtxt() changes

Tue Nov 25 15:01:17 EST 2008

On Nov 25, 2008, at 2:26 PM, John Hunter wrote:
>
> Yes, I've said on a number of occasions I'd like to see these
> functions in numpy, since a number of them make more sense as numpy
> methods than as stand alone functions.

Great. Could we think about getting that on for 1.3x, would you have  
time ? Or should we wait till early jan. ?

> One other thing that is essential for me is that date support is
> included.

As I mentioned in an earlier post, I needed to get a replacement for a  
function in scikits.timeseries, where we do need dates, but I also  
needed something not too specific for numpy.ma. So I thought about  
extracting the conversion methods from the bulk of the function and  
create this new object, StringConverter, that takes care of the  
conversion. If you need to add date support, the simplest is to extend  
your StringConverter to take the date/datetime functions just after  
you import _preview (or numpy.ma.io if we go that path)

 >>> dateparser = dateutil.parser.parse
 >>> # Update the StringConvert mapper, so that date-like columns are  
automatically
 >>> # converted
 >>> _preview.StringConverter.mapper.insert(-1, (dateparser,
         					                                    datetime.date(2000,  
1, 1)))
That way, if a date is found i one of the column, it'll be converted  
appropiately. Seems to work pretty well for scikits.timeseries, I'll  
try to post that in the next couples of weeks (once I ironed out some  
of the numpy.ma bugs...)

> Another feature that is critical to me is to be able to get a
> np.recarray back instead of a record array.  I use these all day long,
> and the convenience of r.date over r['date'] is too much for me to
> give up.

No problem: just take a view once you got your output. I thought about  
adding yet another parameter that'd take care of that directly, but  
then we end up with far too many keywords...
>
> One last thing, I consider the masked array support in csv2rec
> somewhat broken because when using a masked array you cannot get at
> the data (eg datetime methods or string methods) directly using the
> same interface that regular recarrays use.

Well, it's more mrecords which is broken. I committed some fix a  
little while back, but it might not be very robust. I need to check  
that w/ your example.

> Perhaps the best solution is to force the user to ask for masked
> support, and then always return a masked array whether any of the data
> is masked or not.  csv2rec conditionally returns a masked array only
> if some of the data are masked, which makes it difficult to use.

Forcing to a flexible masked array would make quite sense if we pushed  
that function in numpy.ma.io. I don't think we should overload  
np.loadtxt too much anyway...

On Nov 25, 2008, at 2:37 PM, Ryan May wrote:
>
> What about doing the parsing and type inference in a loop and holding
> onto the already split lines?  Then loop through the lines with the
> converters that were finally chosen?  In addition to making my usecase
> work, this has the benefit of not doing the I/O twice.

You mean, filling a list and relooping on it if we need to ? Sounds  
like a plan, but doesn't it create some extra temporaries we may not  
want ?

> I understand you're not concerned with backwards compatibility, but  
> with
> the exception of missing handling, which is probably specific to  
> masked
> arrays, I was hoping to just add functionality to loadtxt().  Numpy
> doesn't need a separate text reader for most of this and breaking API
> for any of this is likely a non-starter.  So while, yes, having  
> float be
> the default dtype is probably not the most useful, leaving it also
> doesn't break existing code.

Depends on how we do it. We could have a  modified np.loadtxt that  
takes some of the ideas of the file I send you (the StringConverter,  
for example), then I could have a numpy.ma.io that would take care of  
the missing data. And something in scikits.timeseries for the dates...

The new np.loadtxt could use the default of the initial one, or we  
could create yet another function (np.loadfromtxt) that would match  
what I was suggesting, and np.loadtxt would be a special stripped  
downcase with dtype=float by default.

thoughts?