[SciPy-User] Status of TimeSeries SciKit
Pierre GM
pgmdevlist at gmail.com
Wed Jul 27 15:16:14 EDT 2011
On Jul 27, 2011, at 8:09 PM, Wes McKinney wrote:
> On Wed, Jul 27, 2011 at 1:54 PM, Matt Knox <mattknox.ca at gmail.com> wrote:
>>
>> Wes McKinney <wesmckinn <at> gmail.com> writes:
>>
>>>> Frequency conversion flexibility:>
>>>> - allow you to specify where to place the value - the start or end of the
>>>> period - when converting from lower frequency to higher frequency (eg.
>>>> monthly to daily)
>>>
>>> I'll make sure to make this available as an option. down going
>>> low-to-high you have two interpolation options: forward fill (aka
>>> "pad") and back fill, which I think is what you're saying?
>>>
>>
>> I guess I had a bit of a misunderstanding when I wrote this comment because I
>> was framing things in the context of how I think about the scikits.timeseries
>> module. Monthly frequency dates (or TimeSeries) in the scikit don't have any
>> day information at all. So when converting to daily you need to tell it
>> where to place the value (eg. Jan 1, or Jan 31). Note that this is a SEPARATE
>> decision from wanting to back fill or forward fill.
>>
>> However, since pandas uses regular datetime objects, the day of the month is
>> already embedded in it. A potential drawback of this approach is that to
>> support "start of period" stuff you need to add a separate frequency,
>> effectively doubling the number of frequencies. And if you account for
>> "business day end of month" and "regular day end of month", then you have to
>> quadruple the number of frequencies. You'd have "EOM", "SOM", "BEOM", "BSOM".
>> Similarly for all the quarterly frequencies, annual frequencies, and so on.
>> Whether this is a major problem in practice or not, I don't know.
>
> I see what you mean. I'm going to wait until the dust on the NumPy
> stuff settles and then figure out what to do. Using datetime objects
> is good and bad-- it makes life a lot easier in many ways but some
> things are less clean as a result. Should start documenting all the
> use cases on a wiki somewhere.
That's why we used integers to represent dates. We have rules to convert from integers to date times and back.
>>
>> I think it is missing quarterly frequencies anchored at the other 9 months of
>> the year. If, for example, you work at a weird Canadian Bank like me, then your
>> fiscal year end is October.
>
> For quarterly you need only anchor on Jan/Feb/March right?
No. You need to be able to define your own quarters. For example, it's fairly common in climatology to define a winter as DJF, so your year actually start on March 1st
>
>>>> Indexing:
>>>> - slicing with dates (looks like "truncate" method does this, but would
>>>> be nice to be able to just use slicing directly)
>>>
>>> you can use fancy indexing to do this now, e.g:
>>>
>>> ts.ix[d1:d2]
>>>
>>> I could push this down into __getitem__ and __setitem__ too without much work
>>
>> I see. I'd be +1 on pushing it down into __getitem__ and __setitem__
>
> I agree, little harm done. The main annoying detail here is working
> with integer labels. __getitem__ needs to be integer-based when you
> have integers, while using .ix[...] will do label-based always.
Overloading __g/setitem__ isn't always ideal in Python. That was one aspect I tried to push to C but it still needs a lot of work.
>
>>>> - full missing value support (TimeSeries class is a subclass of MaskedArray)
>>>
>>> I challenge you to find a (realistic) use case where the missing value
>>> support in pandas in inadequate. I'm being completely serious =) But
>>> I've been very vocal about my dislike of MaskedArrays in the missing
>>> data discussions. They're hard for (normal) people to use, degrade
>>> performance, use extra memory, etc. They add a layer of complication
>>> for working with time series that strikes me as completely
>>> unnecessary.
</sigh>
Let's wait a bit and see how missing/ignored values are getting supported, shall we ?
More information about the SciPy-User
mailing list