[Numpy-discussion] The date/time dtype and the casting issue

Tue Jul 29 14:47:34 EDT 2008

A Tuesday 29 July 2008, Tom Denniston escrigué:
> Francesc,
>
> The datetime proposal is very impressive in its depth and thought.
> For me as well as many other people this would be a massive
> improvement to numpy and allow numpy to get a foothold in areas like
> econometrics where R/S is now dominant.
>
> I had one question regarding casting of strings:
>
> I think it would be ideal if things like the following worked:
> >>> series = numpy.array(['1970-02-01','1970-09-01'], dtype =
> >>> 'datetime64[D]') series == '1970-02-01'
>
> [True, False]
>
> I view this as similar to:
> >>> series = numpy.array([1,2,3], dtype=float)
> >>> series == 2
>
> [False,True,False]

Good point.  Well, I agree that adding the support for setting elements 
from strings, i.e.:

>>> t = numpy.ones(3, 'T8[D]')
>>> t[0] = '2001-01-01'

should be supported.  With this, and appyling the broadcasting rules, 
then the next:

>>> t == '2001-01-01'
[True, False, False]

should work without problems.  We will try to add this explicitely into 
the new proposal.

> 1. However it does numpy recognizes that an int is comparable with a
> float and does the float cast.  I think you want the same behavior
> between strings that parse into dates and date arrays.  Some might
> object that the relationship between string and date is more tenuous
> than float and int, which is true, but having used my own homespun
> date array numpy extension for over a year, I've found that the first
> thing I did was wrap it into an object that handles these
> string->date translations elegantly and that made it infinately more
> usable from an ipython session.

Well, you should not worry because of this.  Hopefully, in the

>>> t == '2001-01-01'

comparison, the scalar part of the expression can be casted into a date 
array, and then the proper comparison will be performed.  If this 
cannot be done for some reason that scapes me, one will always be able 
to do:

>>> t == N.datetime64('2001-01-01', 'Y')
[True, False, False]

which is a bit more verbose, but much more clear too.

> 2. Even more important to me, however, is the issue of date parsing.
> The mx library does many things badly but it does do a great job of
> parsing dates of many formats.  When you parse '1/1/95' or
> 1995-01-01' it knows that you mean 19950101 which is really nice.  I
> believe the scipy timeseries code for parsing dates is based on it. 
> I would highly suggest starting with that level of functionality. 
> The one major issue with it is an uninterpretable date doesn't throw
> an error but becomes whatever date is right now.  That is obviously
> unfavorable.

Hmmm.  We would not like to clutter too much the NumPy core with too 
much date string parsing code.  As it is said in the proposal, we only 
plan to support the parsing for the ISO 8601.  That should be enough 
for most of purposes.  However, I'm sure that parsing for other formats 
will be available in the ``Date`` class of the TimeSeries package.

> 3. Finally my current implementation uses floats uses nan to
> represent an invalid date.  When you assign an element of an date
> array to None it uses nan as the value.  When you assign a real date
> it puts in the equivalent floating point value.  I have found this to
> be hugely beneficial and just wanted to float the idea of reserving a
> value to indicate the floating point equivalent of nan.  People might
> prefer masked arrays as a solution, but I just wanted to float the
> idea.

Hmm, that's another very valid point.  In fact, Ivan and me had already 
foreseen the existence of a NaT (Not A Time), as the maximum negative 
integer (-2**63).  However, as the underlying type of the proposed time 
type is an int64, the arithmetic operations with the time types will be 
done through integer arithmetic, and unfortunately, the majority of 
platforms out there perform this kind of arithmetic as two's-complement 
arithmetic.  That means that there is not provision for handling NaT's 
in hardware:

In [58]: numpy.int64(-2**63)
Out[58]: -9223372036854775808  # this is a NaT

In [59]: numpy.int64(-2**63)+1
Out[59]: -9223372036854775807  # no longer a NaT

In [60]: numpy.int64(-2**63)-1
Out[60]: 9223372036854775807   # idem, and besides, positive!

So, well, due to this limitation, I'm afraid that we will have to live 
without a proper handling of NaT times.  Perhaps this would be the 
biggest limitation of choosing int64 as the base type of the date/time 
dtype (float64 is better in that regard, but has also its 
disadvantages, like the variable precision which is intrinsic to it).

> Forgive me if any of this has already been covered.  There has been a
> lot of volume on this subject and I've tried to read it all
> diligently but may have missed a point or two.

Not at all.  You've touched important issues.  Thanks!

-- 
Francesc Alted