[SciPy-User] Status of TimeSeries SciKit
Wes McKinney
wesmckinn at gmail.com
Wed Jul 27 14:09:07 EDT 2011
On Wed, Jul 27, 2011 at 1:54 PM, Matt Knox <mattknox.ca at gmail.com> wrote:
>
> Wes McKinney <wesmckinn <at> gmail.com> writes:
>
>> > Frequency conversion flexibility:>
>> > - allow you to specify where to place the value - the start or end of the
>> > period - when converting from lower frequency to higher frequency (eg.
>> > monthly to daily)
>>
>> I'll make sure to make this available as an option. down going
>> low-to-high you have two interpolation options: forward fill (aka
>> "pad") and back fill, which I think is what you're saying?
>>
>
> I guess I had a bit of a misunderstanding when I wrote this comment because I
> was framing things in the context of how I think about the scikits.timeseries
> module. Monthly frequency dates (or TimeSeries) in the scikit don't have any
> day information at all. So when converting to daily you need to tell it
> where to place the value (eg. Jan 1, or Jan 31). Note that this is a SEPARATE
> decision from wanting to back fill or forward fill.
>
> However, since pandas uses regular datetime objects, the day of the month is
> already embedded in it. A potential drawback of this approach is that to
> support "start of period" stuff you need to add a separate frequency,
> effectively doubling the number of frequencies. And if you account for
> "business day end of month" and "regular day end of month", then you have to
> quadruple the number of frequencies. You'd have "EOM", "SOM", "BEOM", "BSOM".
> Similarly for all the quarterly frequencies, annual frequencies, and so on.
> Whether this is a major problem in practice or not, I don't know.
I see what you mean. I'm going to wait until the dust on the NumPy
stuff settles and then figure out what to do. Using datetime objects
is good and bad-- it makes life a lot easier in many ways but some
things are less clean as a result. Should start documenting all the
use cases on a wiki somewhere.
>> > - support of a larger number of frequencies
>>
>> Which ones are you thinking of? Currently I have:
>>
>> - hourly, minutely, secondly (and things like 5-minutely can be done,
>> e.g. Minute(5))
>> - daily / business daily
>> - weekly (anchored on a particular weekday)
>> - monthly / business month-end
>> - (business) quarterly, anchored on jan/feb/march
>> - annual / business annual (start and end)
>
> I think it is missing quarterly frequencies anchored at the other 9 months of
> the year. If, for example, you work at a weird Canadian Bank like me, then your
> fiscal year end is October.
For quarterly you need only anchor on Jan/Feb/March right?
In [76]: list(DateRange('1/1/2000', '1/1/2002',
offset=datetools.BQuarterEnd(startingMonth=1)))
Out[76]:
[datetime.datetime(2000, 1, 31, 0, 0),
datetime.datetime(2000, 4, 28, 0, 0),
datetime.datetime(2000, 7, 31, 0, 0),
datetime.datetime(2000, 10, 31, 0, 0),
datetime.datetime(2001, 1, 31, 0, 0),
datetime.datetime(2001, 4, 30, 0, 0),
datetime.datetime(2001, 7, 31, 0, 0),
datetime.datetime(2001, 10, 31, 0, 0)]
(I know, I'm trying to get rid of the camel casing floating around...)
> Other than that, it has all the frequencies I care about. Semi-annual would be
> a nice touch, but not that important to me and timeseries module doesn't have
> it either. People have also asked for higher frequencies in the timeseries
> module before (eg. millisecond), but that is not something I personally care
> about.
numpy.datetime64 will help here. I've a mind to start playing with TAQ
(US equity tick data) in the near future in which case my requirements
will change.
>> > Indexing:
>> > - slicing with dates (looks like "truncate" method does this, but would
>> > be nice to be able to just use slicing directly)
>>
>> you can use fancy indexing to do this now, e.g:
>>
>> ts.ix[d1:d2]
>>
>> I could push this down into __getitem__ and __setitem__ too without much work
>
> I see. I'd be +1 on pushing it down into __getitem__ and __setitem__
I agree, little harm done. The main annoying detail here is working
with integer labels. __getitem__ needs to be integer-based when you
have integers, while using .ix[...] will do label-based always.
>> > - full missing value support (TimeSeries class is a subclass of MaskedArray)
>>
>> I challenge you to find a (realistic) use case where the missing value
>> support in pandas in inadequate. I'm being completely serious =) But
>> I've been very vocal about my dislike of MaskedArrays in the missing
>> data discussions. They're hard for (normal) people to use, degrade
>> performance, use extra memory, etc. They add a layer of complication
>> for working with time series that strikes me as completely
>> unnecessary.
>
> From my understanding, pandas just uses nans for missing values. So that means
> strings, int's, or anything besides floats are not supported. So that
> is my major issue with it. I agree that masked arrays are overly complicated
> and it is not ideal. Hopefully the improved missing value support in numpy will
> provide the best of both worlds.
It's admittedly a kludge but I use NaN as the universal missing-data
marker for lack of a better alternative (basically I'm trying to
emulate R as much as I can). so you can literally have:
In [93]: df2
Out[93]:
A B C D E
0 foo one -0.7883 0.7743 False
1 NaN one -0.5866 0.06009 False
2 foo two 0.9312 1.2 True
3 NaN three -0.6417 0.3444 False
4 foo two -0.8841 -0.08126 False
5 bar two 1.194 -0.7933 True
6 foo one -1.624 -0.1403 NaN
7 foo three 0.5046 0.5833 True
To cope with this there are functions isnull and notnull which work on
every dtype and can recognize NaNs in non-floating point arrays:
In [96]: df2[notnull(df2['A'])]
Out[96]:
A B C D E
0 foo one -0.7883 0.7743 False
2 foo two 0.9312 1.2 True
4 foo two -0.8841 -0.08126 False
5 bar two 1.194 -0.7933 True
6 foo one -1.624 -0.1403 NaN
7 foo three 0.5046 0.5833 True
In [98]: df2['E'].fillna('missing')
Out[98]:
0 foo
1 missing
2 foo
3 missing
4 foo
5 bar
6 foo
7 foo
trying to index with a "boolean" array with NAs gives a slightly
helpful error message:
In [101]: df2[df2['E']]
ValueError: cannot index with vector containing NA / NaN values
but
In [102]: df2[df2['E'].fillna(False)]
Out[102]:
A B C D E
2 foo two 0.9312 1.2 True
5 bar two 1.194 -0.7933 True
7 foo three 0.5046 0.5833 True
Really crossing my fingers for favorable NA support in NumPy.
> - Matt
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
More information about the SciPy-User
mailing list