[Numpy-discussion] timezones and datetime64

Wed Apr 3 09:49:44 EDT 2013

On Wed, Apr 3, 2013 at 2:26 PM, Dave Hirschfeld
<dave.hirschfeld at gmail.com> wrote:
> Andreas Hilboll <lists <at> hilboll.de> writes:
>> > I think your point about using current timezone in interpreting user
>> > input being dangerous is probably correct --- perhaps UTC all the way
>> > would be a safer (and simpler) choice?
>>
>> +1
>>
>
> +10 from me!
>
> I've recently come across a bug due to the fact that numpy interprets dates as
> being in the local timezone.
>
> The data comes from a database query where there is no timezone information
> supplied (and dates are stored as strings). It is assumed that the user doesn't
> need to know the timezone - i.e. the dates are timezone naive.
>
> Working out the correct timezones would be fairly laborious, but whatever the
> correct timezones are, they're certainly not the timezone the current user
> happens to find themselves in!
>
> e.g.
>
> In [32]: rs = [
>     ...: (u'2000-01-17 00:00:00.000000', u'2000-02-01', u'2000-02-29', 0.1203),
>     ...: (u'2000-01-26 00:00:00.000000', u'2000-02-01', u'2000-02-29', 0.1369),
>     ...: (u'2000-01-18 00:00:00.000000', u'2000-03-01', u'2000-03-31', 0.1122),
>     ...: (u'2000-02-25 00:00:00.000000', u'2000-03-01', u'2000-03-31', 0.1425)
>     ...: ]
>     ...: dtype = [('issue_date', 'datetime64[ns]'),
>     ...:          ('start_date', 'datetime64[D]'),
>     ...:          ('end_date', 'datetime64[D]'),
>     ...:          ('value', float)]
>     ...: #
>
> In [33]: # What I see in London, UK
>     ...: recordset = np.array(rs, dtype=dtype)
>     ...: df = pd.DataFrame(recordset)
>     ...: df = df.set_index('issue_date')
>     ...: df
>     ...:
> Out[33]:
>                     start_date            end_date   value
> issue_date
> 2000-01-17 2000-02-01 00:00:00 2000-02-29 00:00:00  0.1203
> 2000-01-26 2000-02-01 00:00:00 2000-02-29 00:00:00  0.1369
> 2000-01-18 2000-03-01 00:00:00 2000-03-31 00:00:00  0.1122
> 2000-02-25 2000-03-01 00:00:00 2000-03-31 00:00:00  0.1425
>
> In [34]: # What my colleague sees in Auckland, NZ
>     ...: recordset = np.array(rs, dtype=dtype)
>     ...: df = pd.DataFrame(recordset)
>     ...: df = df.set_index('issue_date')
>     ...: df
>     ...:
> Out[34]:
>                              start_date            end_date   value
> issue_date
> 2000-01-16 11:00:00 2000-02-01 00:00:00 2000-02-29 00:00:00  0.1203
> 2000-01-25 11:00:00 2000-02-01 00:00:00 2000-02-29 00:00:00  0.1369
> 2000-01-17 11:00:00 2000-03-01 00:00:00 2000-03-31 00:00:00  0.1122
> 2000-02-24 11:00:00 2000-03-01 00:00:00 2000-03-31 00:00:00  0.1425
>
>
> Oh dear!
>
> This isn't acceptable for my use case (in a multinational company) and I found
> no reasonable way around it other than bypassing the numpy conversion entirely
> by setting the dtype to object, manually parsing the strings and creating an
> array from the list of datetime objects.

Wow, that's truly broken. I'm sorry.

I'm skeptical that just switching to UTC everywhere is actually the
right solution. It smells like one of those solutions that's simple,
neat, and wrong. (I don't know anything about calendar-time series
handling, so I have no ability to actually judge this stuff, but
wouldn't one problem be if you want to know about business days/hours?
You lose the original day-of-year once you move everything to UTC.)
Maybe datetime dtypes should be parametrized by both granularity and
timezone? Or we could just declare that datetime64 is always
timezone-naive and adjust the code to match?

I'll CC the pandas list in case they have some insight. Unfortunately
AFAIK no-one who's regularly working on numpy this point works with
datetimes, so we have limited ability to judge solutions... please
help!

-n