[Numpy-discussion] fixing up datetime
Dave Hirschfeld
dave.hirschfeld at gmail.com
Tue Jun 7 17:13:17 EDT 2011
Christopher Barker <Chris.Barker <at> noaa.gov> writes:
>
> Dave Hirschfeld wrote:
> > That would be one way of dealing with irregularly spaced data. I would argue
> > that the example is somewhat back-to-front though. If something happens
> > twice a month it's not occuring at a monthly frequency, but at a higher
> > frequency.
>
> And that frequency is 2/month.
>
> > In this case the lowest frequency which can capture this data is
> > daily frequency so it should be stored at daily frequency
>
> I don't think it should, because it isn't 1/15 days, or, indeed, an
> frequency that can be specified as days. Sure you can specify the 5th
> and 20th of each month in a given time span in terms of days since an
> epoch, but you've lost some information there. When you want to do math
> -- like add a certain number of 1/2 months -- when is the 100th payment due?
> It seems keeping it in M/2 is the most natural way to deal with that
> -- then you don't need special code to do that addition, only when
> converting to a string (or other format) datetime.
With a monthly frequency you can't represent "2/month" except using a 2D array
as in my second example. This does however discard the information about when in
the month the payments occurred/are due.
For that matter I'm not sure where the information that the events occur on the
5th and the 20th is recorded in the dtype [M]//2? Maybe my mental model is too
fixed in the scikits.timeseries mode and I need to try out the new code...
You're right that 2 per month isn't a daily frequency however as you noted the
data can be stored in a daily frequency array with missing values for each date
where a payment doesn't occur.
The timeseries package uses a masked array to represent the missing data however
you aren't required to have data for every day, only when you call the
.fill_missing_dates() function is the array expanded to include a datapoint for
every period.
As shown below the timeseries is at a daily frequency but only datapoints for
the 5th and 20th of the month are included:
In [21]: payments
Out[21]:
timeseries([ 103.76588849 101.29566771 91.10363573 101.90578443 102.125889
89.86413807 94.89200485 93.69989375 103.37375202 104.7628273
97.45956699 93.39594431 94.79258639 102.90656477 87.42346985
91.43556069 95.21947628 93.0671271 107.07400065 92.0835356
94.11035154 86.66521318 109.36556861 101.69789341],
dates = [05-Jan-2011 20-Jan-2011 05-Feb-2011 20-Feb-2011 05-Mar-2011
20-Mar-2011
05-Apr-2011 20-Apr-2011 05-May-2011 20-May-2011 05-Jun-2011 20-Jun-2011
05-Jul-2011 20-Jul-2011 05-Aug-2011 20-Aug-2011 05-Sep-2011 20-Sep-2011
05-Oct-2011 20-Oct-2011 05-Nov-2011 20-Nov-2011 05-Dec-2011 20-Dec-2011],
freq = D)
If I want to see when the 5th payment is due I can simply do:
In [26]: payments[4:5]
Out[26]:
timeseries([ 102.12588909],
dates = [05-Mar-2011],
freq = D)
Advancing the payments by a fixed number of days is possible:
In [28]: payments.dates[:] += 3
In [29]: payments.dates
Out[29]:
DateArray([08-Jan-2011, 23-Jan-2011, 08-Feb-2011, 23-Feb-2011, 08-Mar-2011,
23-Mar-2011, 08-Apr-2011, 23-Apr-2011, 08-May-2011, 23-May-2011,
08-Jun-2011, 23-Jun-2011, 08-Jul-2011, 23-Jul-2011, 08-Aug-2011,
23-Aug-2011, 08-Sep-2011, 23-Sep-2011, 08-Oct-2011, 23-Oct-2011,
08-Nov-2011, 23-Nov-2011, 08-Dec-2011, 23-Dec-2011],
freq='D')
Starting 3 payments in the future is more difficult and would require the date
array to be recreated with the new starting date. One way of doing that would
be:
In [42]: dates = ts.date_array(payments.dates[2], length=(31*payments.size)//2)
In [43]: dates = dates[(dates.day == 8) | (dates.day == 23)][0:payments.size]
In [44]: dates
Out[44]:
DateArray([08-Feb-2011, 23-Feb-2011, 08-Mar-2011, 23-Mar-2011, 08-Apr-2011,
23-Apr-2011, 08-May-2011, 23-May-2011, 08-Jun-2011, 23-Jun-2011,
08-Jul-2011, 23-Jul-2011, 08-Aug-2011, 23-Aug-2011, 08-Sep-2011,
23-Sep-2011, 08-Oct-2011, 23-Oct-2011, 08-Nov-2011, 23-Nov-2011,
08-Dec-2011, 23-Dec-2011, 08-Jan-2012, 23-Jan-2012],
freq='D')
In [45]: dates.shape
Out[45]: (24,)
A bit messy - I'll have to look at the numpy implementation to how it improves
the situation...
Regards,
Dave
More information about the NumPy-Discussion
mailing list