[Numpy-discussion] RFC: A proposal for implementing some date/time types in NumPy

Francesc Alted falted at pytables.org
Fri Jul 11 13:52:32 EDT 2008


A Friday 11 July 2008, Christopher Barker escrigué:
> Francesc Alted wrote:
> > We are planning to implement some date/time types for NumPy,
>
> +1
>
> A couple questions/comments:
> > ``datetime64``
> >   - Expressed in microseconds since POSIX epoch (January 1, 1970).
> >
> >   - Resolution: nanoseconds.
>
> how is that possible? Is that a typo?

Exactly.  This should read *microseconds*.  I've sent the corrected 
version before.

>
> >     This will be compatible with the Python ``datetime`` module
>
> very important!
>
> >   Observations::
> >
> >     This will be not be fully compatible with the Python
> > ``datetime`` module neither in terms of precision nor time-span. 
> > However, getters and setters will be provided for it (loosing
> > precision or overflowing as needed).
>
> How to you propose handling overflowing? Would it raise an exception?

Yes.  We propose to use exactly the same exception handling than NumPy 
(so it will be configurable by the user).

>
> Another option would be to have a version that stored the datetime in
> two values: say two int64s or something (kind of like complex numbers
> are handled). This would allow a long time span and nanosecond (or
> finer) precision. I guess it would require a bunch of math code to be
> written, however.

I suppose so, yes.  Besides, this certainly violates the requeriment of 
having a fast implementation (unless we want to use a lot of time 
optimizing such a 'complex' date/time type).  There is also the problem 
of requiring more space.  See later.

>
> > * ``timefloat64``
> >   - Resolution: 1 microsecond (for +-32 years from epoch) or 14
> > digits (for distant years from epoch).  So the precision is
> > *variable*.
>
> I'm not sure this is that useful, exactly for that reason. What's the
> motivation for it? I can see using a float for timedelta -- as, in
> general, you'll need less precision the linger your time span, but
> having precision depend on how far you happen to be from the epoch
> seems risky (though for anything I do, it wouldn't matter in the
> least).

Well, as I said before, we wanted this mainly for 
geological/astronomical uses, but as this type has the property of 
having microsecond resolution during the years [1902 - 2038], it would 
be definitely useful for many other cases too.

I can say that Postgres, as for one, implements a datetime type based on 
a float64 by default (although you can choose an int64 in compilation 
time) with exactly the same properties than ``timefloat64``.  So, if 
Postgres is doing this, it should be definitely useful in many use 
cases.

>
> > Example of use
> >
> >   In [11]: t[0] = datetime.datetime.now()  # setter in action
> >
> >   In [12]: t[0]
> >   Out[12]: 733234384724   # representation as an int64 (scalar)
>
> hmm - could it return a numpy.datetime object instead, rather than a
> straight int64? I'd like to see a representation that is clearly
> datetime.

Could be. But we should not forget that we are implementing the type for 
an array package, and the output can become cumbersome very soon.   
What I wanted to avoid here was having this:

[datetime(2008, 7, 11, 19, 16, 10, 996509), datetime(2008, 7, 11, 19, 
16, 10, 996535), datetime(2008, 7, 11, 19, 16, 10, 996547), 
datetime(2008, 7, 11, 19, 16, 10, 996559), datetime(2008, 7, 11, 19, 
16, 10, 996568), dtype="datetime64"]

I prefer to see this:

[733234000000, 733234000000, 733234000000, 733234000000, 733234000000, 
dtype="datetime64"]

Hmm, although for a scalar representation, I agree that this is a bit 
too terse.  Maybe adding a 'T' (meaning 'T'ime type) and the end would 
be better?:

In [12]: t[0]
Out[12]: 733234384724T

and hence:

[733234000000T, 733234000000T, 733234000000T, 733234000000T, 
733234000000T, dtype="datetime64"]

But it would be interesting to see what other people thinks.

>
> > About the ``mx.DateTime`` module
> > --------------------------------
> >
> > In this document, the emphasis has been put in comparing the
> > compatibility of future NumPy date/time types against the
> > ``datetime`` module that comes with Python.  Should we consider the
> > compatibility with mx.DateTime as well?
>
> No. The whole point of python's standard datetime is to have a common
> system with which to deal with date-time values -- it's too bad it
> didn't come sooner, so that mx.DateTime could have been built on it,
> but at this point, I think supporting the standard lib one is most
> important.

I see.

> I couldn't find documentation (not quickly, anyway) of how the
> datetime object stores its data internally, but it might be nice to
> support that protocol directly -- maybe that would make for too much
> math code to write, though.

The internal format for the datetime module is documented in the 
sources, and at first sight, supporting the protocol shouldn't be too 
difficult.

> What about timedelta types?

Well, we deliberately have left timedelta out because we think that any 
of the three proposed types can act as a timedelta (this is also 
another reason for keeping the proposed representation, i.e. don't show 
year/month/day/etc... info).  In fact, if they represent an absolute 
time is by the convention of having the origin of time in the UNIX 
epoch.  But if you don't impose this convention for your array, all of 
timetypes can represent timedeltas.

However, I suppose that there is a problem with the getters and setters 
here, that is, how external ``datetime`` timedeltas interacts with the 
new NumPy date/time types.  Thinking a bit, the setter should be 
relatively easy to implement:

In [37]: numpy.datetime64(datetime.timedelta(12))
Out [37]: 12T

For the getter, one can think on adding a new method (only available for 
the date/time types):

In [38]: t = numpy.datetime64(datetime.timedelta(12))

In [39]: t.totimedelta()
Out [39]: datetime.timedelta(12)

IMO, that would solve the issue without having to implement specific 
timedelta types.

> My final thought is that while I see that different applications need
> different properties, having multiple representations seems like it
> will introduce a lot of maintenance, documentation and support
> issues. Maybe a single, more complicated representation would be a
> better bet (like using two ints, rather than one, to get both range
> and precision)

Yeah, but besides the fact that implementation would be quite slower, 
this sort of structs of two 'int64' would take twice the space of the 
proposed timetypes, and this can be killer for a package that is meant 
for dealing with large arrays of data.  [Incidentally, I was even 
pondering to introduce some 32-bit date/time precisely for saving 
space, but as the usability of such a type would be really restricted, 
in the end I've opted to not including it].

> Thanks for working on this -- I think it will be a great addition to
> numpy!

Thanks for excellent feedback too!

-- 
Francesc Alted



More information about the NumPy-Discussion mailing list