[Numpy-discussion] fixing up datetime

Tue Jun 7 17:13:17 EDT 2011

Christopher Barker <Chris.Barker <at> noaa.gov> writes:

> 
> Dave Hirschfeld wrote:
> > That would be one way of dealing with irregularly spaced data. I would argue
> > that the example is somewhat back-to-front though. If something happens
> > twice a month it's not occuring at a monthly frequency, but at a higher
> > frequency.
> 
> And that frequency is 2/month.
> 
> > In this case the lowest frequency which can capture this data is
> > daily frequency so it should be stored at daily frequency
> 
> I don't think it should, because it isn't 1/15 days, or, indeed, an 
> frequency that can be specified as days. Sure you can specify the 5th 
> and 20th of each month in a given time span in terms of days since an 
> epoch, but you've lost some information there. When you want to do math 
> -- like add a certain number of 1/2 months -- when is the 100th payment due?
> It seems keeping it in M/2 is the most natural way to deal with that 
> -- then you don't need special code to do that addition, only when 
> converting to a string (or other format) datetime.

With a monthly frequency you can't represent "2/month" except using a 2D array
as in my second example. This does however discard the information about when in
the month the payments occurred/are due.

For that matter I'm not sure where the information that the events occur on the
5th and the 20th is recorded in the dtype [M]//2? Maybe my mental model is too
fixed in the scikits.timeseries mode and I need to try out the new code...

You're right that 2 per month isn't a daily frequency however as you noted the
data can be stored in a daily frequency array with missing values for each date
where a payment doesn't occur.

The timeseries package uses a masked array to represent the missing data however
you aren't required to have data for every day, only when you call the
.fill_missing_dates() function is the array expanded to include a datapoint for
every period.

As shown below the timeseries is at a daily frequency but only datapoints for
the 5th and 20th of the month are included:

In [21]: payments
Out[21]:
timeseries([ 103.76588849  101.29566771   91.10363573  101.90578443  102.125889
   89.86413807   94.89200485   93.69989375  103.37375202  104.7628273
   97.45956699   93.39594431   94.79258639  102.90656477   87.42346985
   91.43556069   95.21947628   93.0671271   107.07400065   92.0835356
   94.11035154   86.66521318  109.36556861  101.69789341],
   dates = [05-Jan-2011 20-Jan-2011 05-Feb-2011 20-Feb-2011 05-Mar-2011 
   20-Mar-2011
 05-Apr-2011 20-Apr-2011 05-May-2011 20-May-2011 05-Jun-2011 20-Jun-2011
 05-Jul-2011 20-Jul-2011 05-Aug-2011 20-Aug-2011 05-Sep-2011 20-Sep-2011
 05-Oct-2011 20-Oct-2011 05-Nov-2011 20-Nov-2011 05-Dec-2011 20-Dec-2011],
   freq  = D)

If I want to see when the 5th payment is due I can simply do:

In [26]: payments[4:5]
Out[26]:
timeseries([ 102.12588909],
   dates = [05-Mar-2011],
   freq  = D)

Advancing the payments by a fixed number of days is possible:

In [28]: payments.dates[:] += 3

In [29]: payments.dates
Out[29]:
DateArray([08-Jan-2011, 23-Jan-2011, 08-Feb-2011, 23-Feb-2011, 08-Mar-2011,
       23-Mar-2011, 08-Apr-2011, 23-Apr-2011, 08-May-2011, 23-May-2011,
       08-Jun-2011, 23-Jun-2011, 08-Jul-2011, 23-Jul-2011, 08-Aug-2011,
       23-Aug-2011, 08-Sep-2011, 23-Sep-2011, 08-Oct-2011, 23-Oct-2011,
       08-Nov-2011, 23-Nov-2011, 08-Dec-2011, 23-Dec-2011],
          freq='D')

Starting 3 payments in the future is more difficult and would require the date
array to be recreated with the new starting date. One way of doing that would 
be:

In [42]: dates = ts.date_array(payments.dates[2], length=(31*payments.size)//2)

In [43]: dates = dates[(dates.day == 8) | (dates.day == 23)][0:payments.size]

In [44]: dates
Out[44]:
DateArray([08-Feb-2011, 23-Feb-2011, 08-Mar-2011, 23-Mar-2011, 08-Apr-2011,
       23-Apr-2011, 08-May-2011, 23-May-2011, 08-Jun-2011, 23-Jun-2011,
       08-Jul-2011, 23-Jul-2011, 08-Aug-2011, 23-Aug-2011, 08-Sep-2011,
       23-Sep-2011, 08-Oct-2011, 23-Oct-2011, 08-Nov-2011, 23-Nov-2011,
       08-Dec-2011, 23-Dec-2011, 08-Jan-2012, 23-Jan-2012],
          freq='D')

In [45]: dates.shape
Out[45]: (24,)

A bit messy - I'll have to look at the numpy implementation to how it improves
the situation...

Regards,
Dave