[SciPy-User] Status of TimeSeries SciKit

Wed Jul 27 14:09:07 EDT 2011

On Wed, Jul 27, 2011 at 1:54 PM, Matt Knox <mattknox.ca at gmail.com> wrote:
>
> Wes McKinney <wesmckinn <at> gmail.com> writes:
>
>> > Frequency conversion flexibility:>
>> >    - allow you to specify where to place the value - the start or end of the
>> >      period - when converting from lower frequency to higher frequency (eg.
>> >      monthly to daily)
>>
>> I'll make sure to make this available as an option. down going
>> low-to-high you have two interpolation options: forward fill (aka
>> "pad") and back fill, which I think is what you're saying?
>>
>
> I guess I had a bit of a misunderstanding when I wrote this comment because I
> was framing things in the context of how I think about the scikits.timeseries
> module. Monthly frequency dates (or TimeSeries) in the scikit don't have any
> day information at all. So when converting to daily you need to tell it
> where to place the value (eg. Jan 1, or Jan 31). Note that this is a SEPARATE
> decision from wanting to back fill or forward fill.
>
> However, since pandas uses regular datetime objects, the day of the month is
> already embedded in it. A potential drawback of this approach is that to
> support "start of period" stuff you need to add a separate frequency,
> effectively doubling the number of frequencies. And if you account for
> "business day end of month" and "regular day end of month", then you have to
> quadruple the number of frequencies. You'd have "EOM", "SOM", "BEOM", "BSOM".
> Similarly for all the quarterly frequencies, annual frequencies, and so on.
> Whether this is a major problem in practice or not, I don't know.

I see what you mean. I'm going to wait until the dust on the NumPy
stuff settles and then figure out what to do. Using datetime objects
is good and bad-- it makes life a lot easier in many ways but some
things are less clean as a result. Should start documenting all the
use cases on a wiki somewhere.

>> >    - support of a larger number of frequencies
>>
>> Which ones are you thinking of? Currently I have:
>>
>> - hourly, minutely, secondly (and things like 5-minutely can be done,
>> e.g. Minute(5))
>> - daily / business daily
>> - weekly (anchored on a particular weekday)
>> - monthly / business month-end
>> - (business) quarterly, anchored on jan/feb/march
>> - annual / business annual (start and end)
>
> I think it is missing quarterly frequencies anchored at the other 9 months of
> the year. If, for example, you work at a weird Canadian Bank like me, then your
> fiscal year end is October.

For quarterly you need only anchor on Jan/Feb/March right?

In [76]: list(DateRange('1/1/2000', '1/1/2002',
offset=datetools.BQuarterEnd(startingMonth=1)))
Out[76]:
[datetime.datetime(2000, 1, 31, 0, 0),
 datetime.datetime(2000, 4, 28, 0, 0),
 datetime.datetime(2000, 7, 31, 0, 0),
 datetime.datetime(2000, 10, 31, 0, 0),
 datetime.datetime(2001, 1, 31, 0, 0),
 datetime.datetime(2001, 4, 30, 0, 0),
 datetime.datetime(2001, 7, 31, 0, 0),
 datetime.datetime(2001, 10, 31, 0, 0)]

(I know, I'm trying to get rid of the camel casing floating around...)

> Other than that, it has all the frequencies I care about. Semi-annual would be
> a nice touch, but not that important to me and timeseries module doesn't have
> it either. People have also asked for higher frequencies in the timeseries
> module before (eg. millisecond), but that is not something I personally care
> about.

numpy.datetime64 will help here. I've a mind to start playing with TAQ
(US equity tick data) in the near future in which case my requirements
will change.

>> > Indexing:
>> >    - slicing with dates (looks like "truncate" method does this, but would
>> >      be nice to be able to just use slicing directly)
>>
>> you can use fancy indexing to do this now, e.g:
>>
>> ts.ix[d1:d2]
>>
>> I could push this down into __getitem__ and __setitem__ too without much work
>
> I see. I'd be +1 on pushing it down into __getitem__ and __setitem__

I agree, little harm done. The main annoying detail here is working
with integer labels. __getitem__ needs to be integer-based when you
have integers, while using .ix[...] will do label-based always.

>> > - full missing value support (TimeSeries class is a subclass of MaskedArray)
>>
>> I challenge you to find a (realistic) use case where the missing value
>> support in pandas in inadequate. I'm being completely serious =) But
>> I've been very vocal about my dislike of MaskedArrays in the missing
>> data discussions. They're hard for (normal) people to use, degrade
>> performance, use extra memory, etc. They add a layer of complication
>> for working with time series that strikes me as completely
>> unnecessary.
>
> From my understanding, pandas just uses nans for missing values. So that means
> strings, int's, or anything besides floats are not supported. So that
> is my major issue with it. I agree that masked arrays are overly complicated
> and it is not ideal. Hopefully the improved missing value support in numpy will
> provide the best of both worlds.

It's admittedly a kludge but I use NaN as the universal missing-data
marker for lack of a better alternative (basically I'm trying to
emulate R as much as I can). so you can literally have:

In [93]: df2
Out[93]:
    A     B       C        D         E
0   foo   one    -0.7883   0.7743    False
1   NaN   one    -0.5866   0.06009   False
2   foo   two     0.9312   1.2       True
3   NaN   three  -0.6417   0.3444    False
4   foo   two    -0.8841  -0.08126   False
5   bar   two     1.194   -0.7933    True
6   foo   one    -1.624   -0.1403    NaN
7   foo   three   0.5046   0.5833    True

To cope with this there are functions isnull and notnull which work on
every dtype and can recognize NaNs in non-floating point arrays:

In [96]: df2[notnull(df2['A'])]
Out[96]:
    A     B       C        D         E
0   foo   one    -0.7883   0.7743    False
2   foo   two     0.9312   1.2       True
4   foo   two    -0.8841  -0.08126   False
5   bar   two     1.194   -0.7933    True
6   foo   one    -1.624   -0.1403    NaN
7   foo   three   0.5046   0.5833    True

In [98]: df2['E'].fillna('missing')
Out[98]:
0    foo
1    missing
2    foo
3    missing
4    foo
5    bar
6    foo
7    foo

trying to index with a "boolean" array with NAs gives a slightly
helpful error message:

In [101]: df2[df2['E']]
ValueError: cannot index with vector containing NA / NaN values

but

In [102]: df2[df2['E'].fillna(False)]
Out[102]:
    A     B       C        D        E
2   foo   two     0.9312   1.2      True
5   bar   two     1.194   -0.7933   True
7   foo   three   0.5046   0.5833   True

Really crossing my fingers for favorable NA support in NumPy.

> - Matt
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>