[Numpy-discussion] `missing` argument in genfromtxt only a string?

Mon Sep 14 21:59:51 EDT 2009

On Sep 13, 2009, at 3:51 PM, Skipper Seabold wrote:

> On Sun, Sep 13, 2009 at 1:29 PM, Skipper Seabold  
> <jsseabold at gmail.com> wrote:
>> Is there a reason that the missing argument in genfromtxt only  
>> takes a string?

Because we check strings. Note that you can specify several characters  
at once, provided they're separated by a comma, like missing="0,nan,n/a"

>> For instance, I have a dataset that in most columns has a zero for
>> some observations but in others it was just left blank, which is the
>> equivalent of zero.  I would like to set all of the missing to 0 (it
>> defaults to -1 now) when loading in the data.  I suppose I could do
>> this with a converter, but I have too many columns for this.

OK, I see. Gonna try to find some fix.

> All of the missing values in the second observation are now -1.  Also,
> I'm having trouble defining a converter for my dates.
>
> I have the function
>
> from datetime import datetime
>
> def str2date(date):
>    day,month,year = date.strip().split('/')
>    return datetime(*map(int, [year, month, day]))
>
> conv = {1 : lambda s: str2date(s)}
> s.seek(0)
> data = np.genfromtxt(s, dtype=None, delimiter=",", names=None,  
> converters=conv)

OK, I see the problem...
When no dtype is defined, we try to guess what a converter should  
return by testing its inputs. At first we check whether the input is a  
boolean, then whether it's an integer, then a float, and so on. When  
you define explicitly a converter, there's no need for all those  
checks, so we lock the converter to a particular state, which sets the  
conversion function and the value to return in case of missing.
Except that I messed it up and it fails in that case (the conversion  
function is set properly, bu the dtype of the output is still  
undefined). That's a bug, I'll try to fix that once I've tamed my snow  
kitten.
Meanwhile, you can use tsfromtxt (in scikits.timeseries), or even  
simpler, define a dtype for the output (you know that your first  
column is a str, your second an object, and the others ints or floats...