[SciPy-User] Status of TimeSeries SciKit

Wed Jul 27 15:16:14 EDT 2011

On Jul 27, 2011, at 8:09 PM, Wes McKinney wrote:

> On Wed, Jul 27, 2011 at 1:54 PM, Matt Knox <mattknox.ca at gmail.com> wrote:
>> 
>> Wes McKinney <wesmckinn <at> gmail.com> writes:
>> 
>>>> Frequency conversion flexibility:>
>>>>    - allow you to specify where to place the value - the start or end of the
>>>>      period - when converting from lower frequency to higher frequency (eg.
>>>>      monthly to daily)
>>> 
>>> I'll make sure to make this available as an option. down going
>>> low-to-high you have two interpolation options: forward fill (aka
>>> "pad") and back fill, which I think is what you're saying?
>>> 
>> 
>> I guess I had a bit of a misunderstanding when I wrote this comment because I
>> was framing things in the context of how I think about the scikits.timeseries
>> module. Monthly frequency dates (or TimeSeries) in the scikit don't have any
>> day information at all. So when converting to daily you need to tell it
>> where to place the value (eg. Jan 1, or Jan 31). Note that this is a SEPARATE
>> decision from wanting to back fill or forward fill.
>> 
>> However, since pandas uses regular datetime objects, the day of the month is
>> already embedded in it. A potential drawback of this approach is that to
>> support "start of period" stuff you need to add a separate frequency,
>> effectively doubling the number of frequencies. And if you account for
>> "business day end of month" and "regular day end of month", then you have to
>> quadruple the number of frequencies. You'd have "EOM", "SOM", "BEOM", "BSOM".
>> Similarly for all the quarterly frequencies, annual frequencies, and so on.
>> Whether this is a major problem in practice or not, I don't know.
> 
> I see what you mean. I'm going to wait until the dust on the NumPy
> stuff settles and then figure out what to do. Using datetime objects
> is good and bad-- it makes life a lot easier in many ways but some
> things are less clean as a result. Should start documenting all the
> use cases on a wiki somewhere.

That's why we used integers to represent dates. We have rules to convert from integers to date times and back.

>> 
>> I think it is missing quarterly frequencies anchored at the other 9 months of
>> the year. If, for example, you work at a weird Canadian Bank like me, then your
>> fiscal year end is October.
> 
> For quarterly you need only anchor on Jan/Feb/March right?

No. You need to be able to define your own quarters. For example, it's fairly common in climatology to define a winter as DJF, so your year actually start on March 1st

> 
>>>> Indexing:
>>>>    - slicing with dates (looks like "truncate" method does this, but would
>>>>      be nice to be able to just use slicing directly)
>>> 
>>> you can use fancy indexing to do this now, e.g:
>>> 
>>> ts.ix[d1:d2]
>>> 
>>> I could push this down into __getitem__ and __setitem__ too without much work
>> 
>> I see. I'd be +1 on pushing it down into __getitem__ and __setitem__
> 
> I agree, little harm done. The main annoying detail here is working
> with integer labels. __getitem__ needs to be integer-based when you
> have integers, while using .ix[...] will do label-based always.

Overloading __g/setitem__ isn't always ideal in Python. That was one aspect I tried to push to C but it still needs a lot of work.

> 
>>>> - full missing value support (TimeSeries class is a subclass of MaskedArray)
>>> 
>>> I challenge you to find a (realistic) use case where the missing value
>>> support in pandas in inadequate. I'm being completely serious =) But
>>> I've been very vocal about my dislike of MaskedArrays in the missing
>>> data discussions. They're hard for (normal) people to use, degrade
>>> performance, use extra memory, etc. They add a layer of complication
>>> for working with time series that strikes me as completely
>>> unnecessary.

</sigh>
Let's wait a bit and see how missing/ignored values are getting supported, shall we ?