[Numpy-discussion] `missing` argument in genfromtxt only a string?

Skipper Seabold jsseabold at gmail.com
Mon Sep 14 22:31:39 EDT 2009


On Mon, Sep 14, 2009 at 9:59 PM, Pierre GM <pgmdevlist at gmail.com> wrote:
>
> On Sep 13, 2009, at 3:51 PM, Skipper Seabold wrote:
>
>> On Sun, Sep 13, 2009 at 1:29 PM, Skipper Seabold
>> <jsseabold at gmail.com> wrote:
>>> Is there a reason that the missing argument in genfromtxt only
>>> takes a string?
>
> Because we check strings. Note that you can specify several characters
> at once, provided they're separated by a comma, like missing="0,nan,n/a"
>
>>> For instance, I have a dataset that in most columns has a zero for
>>> some observations but in others it was just left blank, which is the
>>> equivalent of zero.  I would like to set all of the missing to 0 (it
>>> defaults to -1 now) when loading in the data.  I suppose I could do
>>> this with a converter, but I have too many columns for this.
>
> OK, I see. Gonna try to find some fix.
>

I actually figured out a workaround with converters, since my missing
values are " ","  ","   " ie., irregular number of spaces and the
values aren't stripped of white spaces.  I just define {# : lambda s:
float(s.strip() or 0)}, and I have a loop build all of the converters,
but then I have to go through and drop the ones that are supposed to
be strings or dates, which is still pretty tedious, since I have a
number of datasets that are like this, but they all contain different
data in different orders and there's no (computer) logical order to it
that I've discovered yet.

>> All of the missing values in the second observation are now -1.  Also,
>> I'm having trouble defining a converter for my dates.
>>
>> I have the function
>>
>> from datetime import datetime
>>
>> def str2date(date):
>>    day,month,year = date.strip().split('/')
>>    return datetime(*map(int, [year, month, day]))
>>
>> conv = {1 : lambda s: str2date(s)}
>> s.seek(0)
>> data = np.genfromtxt(s, dtype=None, delimiter=",", names=None,
>> converters=conv)
>
> OK, I see the problem...
> When no dtype is defined, we try to guess what a converter should
> return by testing its inputs. At first we check whether the input is a
> boolean, then whether it's an integer, then a float, and so on. When
> you define explicitly a converter, there's no need for all those
> checks, so we lock the converter to a particular state, which sets the
> conversion function and the value to return in case of missing.
> Except that I messed it up and it fails in that case (the conversion
> function is set properly, bu the dtype of the output is still
> undefined). That's a bug, I'll try to fix that once I've tamed my snow
> kitten.

No worries.  I really like genfromtxt (having recently gotten pretty
familiar with it) and would like to help out with extending it towards
these kind of cases if there's an interest and this is feasible.

I tried another workaround for the dates with my converters defined as conv

conv.update({date : lambda s : datetime(*map(int,
s.strip().split('/')[-1:]+s.strip().split('/')[:2]))})

Where `date` is the column that contains a date.  The problem was that
my dates are "mm/dd/yyyy" and datetime needs "yyyy,mm,dd," it worked
for a test case if my dates were "dd/mm/yyyy" and I just use reversed,
but gave an error about not finding the day in the third position,
though that lambda function worked for a test case outside of
genfromtxt.

> Meanwhile, you can use tsfromtxt (in scikits.timeseries), or even
> simpler, define a dtype for the output (you know that your first
> column is a str, your second an object, and the others ints or floats...
>

I started to look at the timeseries for this, but I installed it
incorrectly and it gave an error about being compiled with the wrong
endianness.  I've since fixed that and will take another look when I
get a chance.

I also tried the new datetime dtype, but I wasn't sure how to do this
without defining the whole dtype.  I have 500 columns that aren't
homogeneous across several datasets, and each one is pretty huge, so
this is tedious and takes some time to read the data (not using a test
case) and see that it didn't work correctly.

Cheers,

Skipper



More information about the NumPy-Discussion mailing list