NumPy date/time types and the resolution concept
Hi, Before giving more thought to the new proposal of the date/time types for NumPy based in the concept of 'resolution', I'd like to gather more feedback on your opinions about this. After pondering about the opinions about the first proposal, the idea we are incubating is to complement the ``datetime64`` with a 'resolution' metainfo. The ``datetime64`` will still be based on a int64 type, but the meaning of the 'ticks' would depend on a 'resolution' property. This is best seen with an example: In [21]: numpy.arange(3, dtype=numpy.dtype('datetime64', 'sec')) Out [21]: [1970-01-01T00:00:00, 1970-01-01T00:00:01, 1970-01-01T00:00:02] In [22]: numpy.arange(3, dtype=numpy.dtype('datetime64', 'hour')) Out [22]: [1970-01-01T00, 1970-01-01T01, 1970-01-01T02] i.e. the 'resolution' gives the actual meaning to the 'int64' counter. The advantage of this abstraction is that the user can easily choose the scale of resolution that better fits his need. I'm thinking in providing the next resolutions: ["femtosec", "picosec", "nanosec", "microsec", "millisec", "sec", "min", "hour", "month", "year"] Also, together with the absolute ``datetime64`` one can have a relative counterpart, say, ``timedelta64`` that also supports the notion of 'resolution'. Between both one would cover the needs for most uses, while providing the user with a lot of flexibility, IMO. We very much prefer this new approach than the stated in our first proposal. Now, it comes the tricky part: how to integrate the notion of 'resolution' with the 'dtype' data type factory of NumPy? Well, we have thought a couple of possibilities. 1) Using the NumPy 'dtype' factory: nanoabs = numpy.dtype('datetime64', resolution="nanosec") nanorel = numpy.dtype('timedelta64', resolution="nanosec") 2) Extending the string notation by using the '[]' square brackets: nanoabs = numpy.dtype('datetime64[nanosec]') # long notation nanoabs = numpy.dtype('T[nanosec]') # short notation nanorel = numpy.dtype('timedelta64[nanosec]') # long notation nanorel = numpy.dtype('t[nanosec]') # short notation With these building blocks, one may obtain more complex dtype structures easily. Now, the question is: would that proposal enter in conflict with the spirit of the current 'dtype' factory? And another important one, would that complicate the implementation too much? If the answer to the both previous questions is 'no', then we will study this more and provide another proposal based on this. BTW, I suppose that the best candidate to answer these would be Travis O., but if anybody feels brave enough ;-) please go ahead and give your advice. Cheers, -- Francesc Alted
On Monday 14 July 2008 09:07:47 Francesc Alted wrote:
The advantage of this abstraction is that the user can easily choose the scale of resolution that better fits his need. I'm thinking in providing the next resolutions:
["femtosec", "picosec", "nanosec", "microsec", "millisec", "sec", "min", "hour", "month", "year"]
In TimeSeries, we don't have anything less than a second, but we have 'daily', 'business daily', 'weekly' and 'quarterly' resolutions. A very useful point that Matt Knox had coded is the possibility to specify starting points for switching from one resolution to another. For example, you can have a series with a 'ANN_MAR' frequency, that corresponds to 1 point a year, the year starting in April. When switching back to a monthly resolution, the points from January to March of the first year will be masked. Another useful point would be allow the user to define his/her own resolution (every 15min, every 12h...). Right now it's a bit clunky in TimeSeries, we have to use the lowest resolution of the series (min, hour) and leave a lot of blanks (TimeSeries don't have to be regularly spaced, but it helps...)
Now, it comes the tricky part: how to integrate the notion of 'resolution' with the 'dtype' data type factory of NumPy?
In TimeSeries, the frequency is stored as an integer. For example, a daily frequency is stored as 6000, an annual frequency as 1000, a 'ANN_MAR' frequency as 1003...
A Monday 14 July 2008, Pierre GM escrigué:
On Monday 14 July 2008 09:07:47 Francesc Alted wrote:
The advantage of this abstraction is that the user can easily choose the scale of resolution that better fits his need. I'm thinking in providing the next resolutions:
["femtosec", "picosec", "nanosec", "microsec", "millisec", "sec", "min", "hour", "month", "year"]
In TimeSeries, we don't have anything less than a second, but we have 'daily', 'business daily', 'weekly' and 'quarterly' resolutions.
Yes, I forgot the "day" resolution. I suppose that "weekly" and "quaterly" could be added too. However, if we adopt a new way to specify the resolution (see later), these can be stated as '7d' and '3m' respectively. Mmh, not sure about "business daily"; this maybe is useful in time series, but I don't find a reasonable meaning for it as a 'time resolution' (which is a different concept from 'time frequency'). So I'd let it out.
A very useful point that Matt Knox had coded is the possibility to specify starting points for switching from one resolution to another. For example, you can have a series with a 'ANN_MAR' frequency, that corresponds to 1 point a year, the year starting in April. When switching back to a monthly resolution, the points from January to March of the first year will be masked.
Ok. Ann was also suggesting that the origin of time would be configurable, but then, you are talking about *masking* values. Mmm, I don't think we should try to incorporate masking capabilities in the NumPy date/time types. At any rate, I've not thought about the possibility of having an origin defined by the user, but if we could add the 'resolution' metainfo, I don't see why we couldn't do the same with the 'origin' metainfo too.
Another useful point would be allow the user to define his/her own resolution (every 15min, every 12h...). Right now it's a bit clunky in TimeSeries, we have to use the lowest resolution of the series (min, hour) and leave a lot of blanks (TimeSeries don't have to be regularly spaced, but it helps...)
Ok. I see the use case for this, but for implementation purposes, we should come with a more complete way to specify the resolution than I realized before. Hmm, what about the next: [N]timeunit where ``timeunit`` can take the values in: ['y', 'm', 'd', 'h', 'm', 's', 'ms', 'us', 'ns', 'fs'] so, for example, '14d' means a resolution of 14 days, or '10ms' means a resolution of 1 hundreth of second. Sounds good to me. What other people think?
Now, it comes the tricky part: how to integrate the notion of 'resolution' with the 'dtype' data type factory of NumPy?
In TimeSeries, the frequency is stored as an integer. For example, a daily frequency is stored as 6000, an annual frequency as 1000, a 'ANN_MAR' frequency as 1003...
Well, I initially planned to keep the resolution as an enumerated (int8 would be enough), but if the new way to specify resolutions goes ahead, I'm afraid that we may need a fill int64 to save this. But apart from that, this should be not a problem (in general, the metainfo is a very tiny part of the space taken by a dataset). Cheers, -- Francesc Alted
On Monday 14 July 2008 12:50:21 Francesc Alted wrote:
A very useful point that Matt Knox had coded is the possibility to specify starting points for switching from one resolution to another. For example, you can have a series with a 'ANN_MAR' frequency, that corresponds to 1 point a year, the year starting in April. When switching back to a monthly resolution, the points from January to March of the first year will be masked.
Ok. Ann was also suggesting that the origin of time would be configurable, but then, you are talking about *masking* values. Mmm, I don't think we should try to incorporate masking capabilities in the NumPy date/time types.
Francesc, In scikits.timeseries, we have 2 different objects: * DateArray, which is basically a ndarray of integers with a given 'frequency' attribute. * TimeSeries, which is basically the combination of a MaskedArray (the data part) and a DateArray (which keeps track of the date corresponding to each data point. TimeSeries object have the resolution/origin of the companion DateArray, and when they're converted from one resolution to another, some masking may occur. My understanding is that you intend to define an object similar to DateArray. You want to define a new dtype (datetime64 or other), we used yet another class instead, Date. A dtype would be easier to manipulate, but as neither Matt nor I were particularly experienced with that at the time, we followed the simpler approach of a class...
[N]timeunit
where ``timeunit`` can take the values in:
['y', 'm', 'd', 'h', 'm', 's', 'ms', 'us', 'ns', 'fs']
so, for example, '14d' means a resolution of 14 days, or '10ms' means a resolution of 1 hundreth of second. Sounds good to me. What other people think?
Sounds pretty cool and intuitive to use. However, writing the conversion rules from one to another will be a lot of fun. Take weekly, for example: that's a period of 7 days, but when does it start ? On a monday ? Then, 12/31/2007 was the start of the first week of 2008... OK, we can leave that problem for the moment...
A Monday 14 July 2008, Pierre GM escrigué:
On Monday 14 July 2008 12:50:21 Francesc Alted wrote:
A very useful point that Matt Knox had coded is the possibility to specify starting points for switching from one resolution to another. For example, you can have a series with a 'ANN_MAR' frequency, that corresponds to 1 point a year, the year starting in April. When switching back to a monthly resolution, the points from January to March of the first year will be masked.
Ok. Ann was also suggesting that the origin of time would be configurable, but then, you are talking about *masking* values. Mmm, I don't think we should try to incorporate masking capabilities in the NumPy date/time types.
Francesc, In scikits.timeseries, we have 2 different objects: * DateArray, which is basically a ndarray of integers with a given 'frequency' attribute. * TimeSeries, which is basically the combination of a MaskedArray (the data part) and a DateArray (which keeps track of the date corresponding to each data point. TimeSeries object have the resolution/origin of the companion DateArray, and when they're converted from one resolution to another, some masking may occur.
My understanding is that you intend to define an object similar to DateArray. You want to define a new dtype (datetime64 or other), we used yet another class instead, Date. A dtype would be easier to manipulate, but as neither Matt nor I were particularly experienced with that at the time, we followed the simpler approach of a class...
Well, what we are after is precisely this: a new dtype type. After integrating it in NumPy, I suppose that your DateArray would be similar than a NumPy array with a dtype ``datetime64`` (bar the conceptual differences between your 'frequency' behind DateArray and the 'resolution' behind the datetime64 dtype).
[N]timeunit
where ``timeunit`` can take the values in:
['y', 'm', 'd', 'h', 'm', 's', 'ms', 'us', 'ns', 'fs']
so, for example, '14d' means a resolution of 14 days, or '10ms' means a resolution of 1 hundreth of second. Sounds good to me. What other people think?
Sounds pretty cool and intuitive to use. However, writing the conversion rules from one to another will be a lot of fun. Take weekly, for example: that's a period of 7 days, but when does it start ? On a monday ? Then, 12/31/2007 was the start of the first week of 2008... OK, we can leave that problem for the moment...
It would start when the origin tells that it should start. It is important to note that our proposal will not force a '7d' (seven days) 'tick' to start on monday, or a '1m' (one month) to start the 1st day of a calendar month, but rather where the user decides to set its origin. Cheers, -- Francesc Alted
On Monday 14 July 2008 14:17:18 Francesc Alted wrote:
Well, what we are after is precisely this: a new dtype type. After integrating it in NumPy, I suppose that your DateArray would be similar than a NumPy array with a dtype ``datetime64`` (bar the conceptual differences between your 'frequency' behind DateArray and the 'resolution' behind the datetime64 dtype).
Well, you're losing me on this one: could you explain the difference between the two concepts ? It might only be a problem of vocabulary...
It would start when the origin tells that it should start. It is important to note that our proposal will not force a '7d' (seven days) 'tick' to start on monday, or a '1m' (one month) to start the 1st day of a calendar month, but rather where the user decides to set its origin.
OK, so we need 2 flags, one for the resolution, one for the origin. Because there won't be that many resolution possible, an int8 should be sufficient. What do you have in mind for the origin ? When using a resolution coarser than 1d (7d, 1m, 3m, 1a), an origin in day is OK. What about less than a day ?
A Monday 14 July 2008, Pierre GM escrigué:
On Monday 14 July 2008 14:17:18 Francesc Alted wrote:
Well, what we are after is precisely this: a new dtype type. After integrating it in NumPy, I suppose that your DateArray would be similar than a NumPy array with a dtype ``datetime64`` (bar the conceptual differences between your 'frequency' behind DateArray and the 'resolution' behind the datetime64 dtype).
Well, you're losing me on this one: could you explain the difference between the two concepts ? It might only be a problem of vocabulary...
Maybe is only that. But by using the term 'frequency' I tend to think that you are expecting to have one entry (observation) in your array for each time 'tick' since time start. OTOH, the term 'resolution' doesn't have this implication, and only states the precision of the timestamp. I don't know whether my impression is true or not, but after reading about your TimeSeries package, I'm still thinking that this expectation of one observation per 'tick' was what driven you to choose the 'frequency' name.
It would start when the origin tells that it should start. It is important to note that our proposal will not force a '7d' (seven days) 'tick' to start on monday, or a '1m' (one month) to start the 1st day of a calendar month, but rather where the user decides to set its origin.
OK, so we need 2 flags, one for the resolution, one for the origin. Because there won't be that many resolution possible, an int8 should be sufficient. What do you have in mind for the origin ? When using a resolution coarser than 1d (7d, 1m, 3m, 1a), an origin in day is OK. What about less than a day ?
Well, after reading the mails from Chris and Anne, I think the best is that the origin would be kept as an int64 with a resolution of microseconds (for compatibility with the ``datetime`` module, as I've said before). Cheers, -- Francesc Alted
On Tuesday 15 July 2008 07:30:09 Francesc Alted wrote:
Maybe is only that. But by using the term 'frequency' I tend to think that you are expecting to have one entry (observation) in your array for each time 'tick' since time start. OTOH, the term 'resolution' doesn't have this implication, and only states the precision of the timestamp.
OK, now I get it.
I don't know whether my impression is true or not, but after reading about your TimeSeries package, I'm still thinking that this expectation of one observation per 'tick' was what driven you to choose the 'frequency' name.
Well, we do require a "one point per tick" for some operations, such as conversion from one frequency to another, but only for TimeSeries. A Date Array doesn't have to be regularly spaced.
A Tuesday 15 July 2008, Pierre GM escrigué:
On Tuesday 15 July 2008 07:30:09 Francesc Alted wrote:
Maybe is only that. But by using the term 'frequency' I tend to think that you are expecting to have one entry (observation) in your array for each time 'tick' since time start. OTOH, the term 'resolution' doesn't have this implication, and only states the precision of the timestamp.
OK, now I get it.
I don't know whether my impression is true or not, but after reading about your TimeSeries package, I'm still thinking that this expectation of one observation per 'tick' was what driven you to choose the 'frequency' name.
Well, we do require a "one point per tick" for some operations, such as conversion from one frequency to another, but only for TimeSeries. A Date Array doesn't have to be regularly spaced.
Ok, I see. So, it is just the 'frequency' keyword that was misleading me. Thanks for the clarification. Cheers, -- Francesc Alted
2008/7/15 Francesc Alted <faltet@pytables.org>:
Maybe is only that. But by using the term 'frequency' I tend to think that you are expecting to have one entry (observation) in your array for each time 'tick' since time start. OTOH, the term 'resolution' doesn't have this implication, and only states the precision of the timestamp.
Well, after reading the mails from Chris and Anne, I think the best is that the origin would be kept as an int64 with a resolution of microseconds (for compatibility with the ``datetime`` module, as I've said before).
A couple of details worth pointing out: we don't need a zillion resolutions. One that's as good as the world time standards, and one that spans an adequate length of time should cover it. After all, the only reason for not using the highest available resolution is if you want to cover a larger range of times. So there is no real need for microseconds and milliseconds and seconds and days and weeks and... There is also no need for the origin to be kept with a resolution as high as microseconds; seconds would do just fine, since if necessary it can be interpreted as "exactly 7000 seconds after the epoch" even if you are using femtoseconds elsewhere. Anne
A Tuesday 15 July 2008, Anne Archibald escrigué:
2008/7/15 Francesc Alted <faltet@pytables.org>:
Maybe is only that. But by using the term 'frequency' I tend to think that you are expecting to have one entry (observation) in your array for each time 'tick' since time start. OTOH, the term 'resolution' doesn't have this implication, and only states the precision of the timestamp.
Well, after reading the mails from Chris and Anne, I think the best is that the origin would be kept as an int64 with a resolution of microseconds (for compatibility with the ``datetime`` module, as I've said before).
A couple of details worth pointing out: we don't need a zillion resolutions. One that's as good as the world time standards, and one that spans an adequate length of time should cover it. After all, the only reason for not using the highest available resolution is if you want to cover a larger range of times. So there is no real need for microseconds and milliseconds and seconds and days and weeks and...
Maybe you are right, but by providing many resolutions we are trying to cope with the needs of people that are using them a lot. In particular, we are willing that the authors of the timseries scikit can find on these new dtype a fair replacement of their Date class (our proposal will be not so featured, but...).
There is also no need for the origin to be kept with a resolution as high as microseconds; seconds would do just fine, since if necessary it can be interpreted as "exactly 7000 seconds after the epoch" even if you are using femtoseconds elsewhere.
Good point. However, we finally managed to not include the ``origin`` metadata in our new proposal. Have a look at the second proposal that I'll be posting soon for details. Cheers, -- Francesc Alted
Maybe you are right, but by providing many resolutions we are trying to cope with the needs of people that are using them a lot. In particular, we are willing that the authors of the timseries scikit can find on these new dtype a fair replacement of their Date class (our proposal will be not so featured, but...).
I think a basic date/time dtype for numpy would be a nice addition for general usage. Now as for the timeseries module using this dtype for most of the date-fu that goes on... that would be a bit more challenging. Unless all of the frequencies/resolutions currently supported in the timeseries scikit are supported with the new dtype, it is unlikely we would be able to replace our implementation. In particular, business day frequency (Monday - Friday) is of central importance for working with financial time series (which was my motivation for the original prototype of the module). But using plain integers for the DateArray class actually seems to work pretty well and I'm not sure a whole lot would be gained by using a date dtype. That being said, if someone creates a fork of the timeseries module using a new date dtype at it's core and it works amazingly well, then I'd probably get on board. I just think that may be difficult to do with a general purpose date dtype suitable for inclusion in the numpy core. - Matt
A Thursday 17 July 2008, Matt Knox escrigué:
Maybe you are right, but by providing many resolutions we are trying to cope with the needs of people that are using them a lot. In particular, we are willing that the authors of the timseries scikit can find on these new dtype a fair replacement of their Date class (our proposal will be not so featured, but...).
I think a basic date/time dtype for numpy would be a nice addition for general usage.
Now as for the timeseries module using this dtype for most of the date-fu that goes on... that would be a bit more challenging. Unless all of the frequencies/resolutions currently supported in the timeseries scikit are supported with the new dtype, it is unlikely we would be able to replace our implementation. In particular, business day frequency (Monday - Friday) is of central importance for working with financial time series (which was my motivation for the original prototype of the module). But using plain integers for the DateArray class actually seems to work pretty well and I'm not sure a whole lot would be gained by using a date dtype.
Yeah, the business week. We've pondered including this, but we are not sure about the differences of such a thing and a calendar week in terms of a time unit. I see for sure its merits on the TimeSeries module, but I'm afraid that it would be non-sense in the context of a general date/time dtype. Now that I think about it, maybe we should revise our initial intention of adding a quarter too, because ISO 8601 does not offer a way to print it nicely. We can also opt by extending the ISO 8601 representation in order to allow the next sort of string representation: In [35]: array([70, 72, 19], 'datetime64[Q]') Out[35]: array([1988Q2, 1988Q4, 1975Q3], dtype="datetime64[Q]") but, I don't know if this would innecessarily complicate things (apart of representing a departure from standards :-/).
That being said, if someone creates a fork of the timeseries module using a new date dtype at it's core and it works amazingly well, then I'd probably get on board. I just think that may be difficult to do with a general purpose date dtype suitable for inclusion in the numpy core.
Yeah, I understand your reasons. In fact, it is a pity that your requeriments diverge in some key points from our proposal for the general dtypes. I have had a look at how you have integrated recarrays in your TimeSeries module, and I'm sure that by choosing a date/time dtype you would be able to reduce the complexity (and specially the efficiency too) of your code quite a few. Cheers, -- Francesc Alted
On Mon, 14 Jul 2008, Francesc Alted apparently wrote:
Before giving more thought to the new proposal of the date/time types for NumPy based in the concept of 'resolution', I'd like to gather more feedback on your opinions about this.
It might be a good idea to run the proposal(s) past Marc-Andre Lemburg mal (at) egenix (dot) com Cheers, Alan Isaac
A Monday 14 July 2008, Alan G Isaac escrigué:
On Mon, 14 Jul 2008, Francesc Alted apparently wrote:
Before giving more thought to the new proposal of the date/time types for NumPy based in the concept of 'resolution', I'd like to gather more feedback on your opinions about this.
It might be a good idea to run the proposal(s) past Marc-Andre Lemburg mal (at) egenix (dot) com
Sure. And maybe also to Fred Drake, the original autor of the ``datetime`` module. However, I'd prefer to send them something in a more advanced state of refinement than it is now. Thanks for the suggestion, -- Francesc Alted
2008/7/14 Francesc Alted <faltet@pytables.org>:
After pondering about the opinions about the first proposal, the idea we are incubating is to complement the ``datetime64`` with a 'resolution' metainfo. The ``datetime64`` will still be based on a int64 type, but the meaning of the 'ticks' would depend on a 'resolution' property.
This is an interesting idea. To be useful, though, you would also need a flexible "offset" defining the zero of time. After all, the reason not to just always use (say) femtosecond accuracy is that 2**64 femtoseconds is only about five hours. So if you're going to use femtosecond steps, you really want to choose your start point carefully. (It's also worth noting that there is little need for more time accuracy than atomic clocks can provide, since anyone looking for more than that is going to be doing some tricky metrology anyway.) One might take guidance from the FITS format, which represents (arrays of) quantities as (usually) fixed-point numbers, but has a global "scale" and "offset" parameter for each array. This allows one to accurately represent many common arrays with relatively few bits. The FITS libraries transparently convert these quantities. Of course, this isn't so convenient if you don't have basic machine datatypes with enough precision to handle all the quantities of interest. Anne
A Monday 14 July 2008, Anne Archibald escrigué:
2008/7/14 Francesc Alted <faltet@pytables.org>:
After pondering about the opinions about the first proposal, the idea we are incubating is to complement the ``datetime64`` with a 'resolution' metainfo. The ``datetime64`` will still be based on a int64 type, but the meaning of the 'ticks' would depend on a 'resolution' property.
This is an interesting idea. To be useful, though, you would also need a flexible "offset" defining the zero of time. After all, the reason not to just always use (say) femtosecond accuracy is that 2**64 femtoseconds is only about five hours. So if you're going to use femtosecond steps, you really want to choose your start point carefully. (It's also worth noting that there is little need for more time accuracy than atomic clocks can provide, since anyone looking for more than that is going to be doing some tricky metrology anyway.)
That's a good point indeed. Well, to start with, I suppose that picosecond resolution is more than enough for nowadays precision standards (even when using atomic clocks). However, provided that atomic clocks are always improving its precision [1], having a femtosecond resolution is not going to bother people, I think. [1] http://en.wikipedia.org/wiki/Image:Clock_accurcy.jpg But the time origin is certainly an issue, yes. See later.
One might take guidance from the FITS format, which represents (arrays of) quantities as (usually) fixed-point numbers, but has a global "scale" and "offset" parameter for each array. This allows one to accurately represent many common arrays with relatively few bits. The FITS libraries transparently convert these quantities. Of course, this isn't so convenient if you don't have basic machine datatypes with enough precision to handle all the quantities of interest.
That's pretty interesting in that the "scale" is certainly something similar to the "resolution" concept that we want to introduce. And definitely, "offset" would be similar to "origin". So yes, we will try to introduce both concepts. However, one thing that we would try to avoid is to use fixed-point arithmetic (we plan to use integer arithmetic only). The rational is that fixed-point arithmetic is computationally more complex (it has to implemented in software, while integer arithmetic is implemented in hardware) and that would slow down things too much. Thanks! -- Francesc Alted
participants (5)
-
Alan G Isaac
-
Anne Archibald
-
Francesc Alted
-
Matt Knox
-
Pierre GM