[Python-Dev] Adding a tzidx cache to datetime

M.-A. Lemburg mal at egenix.com
Fri May 10 03:18:43 EDT 2019

On 10.05.2019 02:58, Paul Ganssle wrote:
> This is only "only" for dateutil in the sense that no one other than
> dateutil implements tzinfo with the interface provided. If dateutil were
> /not/ already implemented with a list of offsets and their indexes, I
> would still propose this, and just re-write dateutil to take advantage
> of it. From a cursory glance at pendulum, it seems that they could take
> advantage of it as well (though they use their own datetime subclass, so
> they have always had the ability to add this).
>> What do you think of adding a private "_cache" attribute which would
>> be an arbitrary Python object? (None by default)
> We cannot use a private attribute (other than to do the actual storage,
> since the thing that gets stored is not directly accessible anyway and
> is instead mediated by a layer that manages the cache) because this is a
> feature explicitly being added for use by tzinfo, /not/ by datetime. If
> it's private then it's not safe for implementations of tzinfo to
> actually use it, which defeats the purpose.
> Regarding the use of an arbitrary Python object: What I'm proposing is
> that we offer a bit of the "free" storage space in the alignment bits to
> tzinfo objects to use as a cache. In /most/ cases this will be very
> useful to someone implementing a tzinfo, because there are really only
> so many ways to accomplish this task, and most time zones are
> expressible as a very short list of offset/name/dst combinations, plus
> some rule for which applies when, which is why a small integer cache is
> sufficient and more or less universal (i.e. not specific to dateutil's
> implementation).
> I will also note that in my design, it is still possible for `tzinfo` to
> return something other than [0, 254], it's just that that information
> will not be cached, so it won't get the benefit of any optimization, but
> the same interface / implementation can be used.
> In my test with gcc, adding an additional PyObject* to the end of the
> PyDateTime_DateTime struct increased the size of the `datetime.datetime`
> object from 64 to 72 bytes, whereas adding an `unsigned char` after the
> `fold` leaves it unchanged. Given that the expansion to arbitrary Python
> objects is speculative and doesn't have any particular use case, I would
> prefer to leave the feature as is, and reconsider the possibility of
> storing arbitrary Python objects on the datetime if there's some
> compelling reason to do so (it would be a backwards-compatible change at
> that point anyway).

Given that many datetime objects in practice don't use timezones
(e.g. in large data stores you typically use UTC and naive datetime
objects), I think that making the object itself larger to accommodate
for a cache, which will only be used a smaller percentage of the use
cases, isn't warranted. Going from 64 bytes to 72 bytes also sounds
like this could have negative effects on cache lines.

If you need a per object cache, you can either use weakref
objects or maintain a separate dictionary in dateutil or other
timezone helpers which indexes objects by id(obj).

That said, if you only add a byte field which doesn't make the object
larger in practice (you merely use space that alignments would
use anyway), this shouldn't be a problem. The use of that field
should be documented, though, so that other implementations can
use/provide it as well.

Marc-Andre Lemburg

Professional Python Services directly from the Experts (#1, May 10 2019)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...           http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...           http://zope.egenix.com/

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611

> On 5/9/19 8:14 PM, Victor Stinner wrote:
>> Hi Paul,
>> The change is basically an optimization. I'm uncomfortable to design
>> it "only" for dateutil. What if tomorrow someone has to store an
>> arbitrary Python object, rather than just an integer (in range [0;
>> 254]), into a datetime for a different optimization?
>> Moreover, I dislike adding a *public* method for an *internal* cache.
>> Right now, it is not possible to create a weak reference to a
>> datetime. If we make it possible, it would be possible to have an
>> external cache implemented with weakref.WeakSet to clear old entries
>> when a datetime object is detroyed.
>> What do you think of adding a private "_cache" attribute which would
>> be an arbitrary Python object? (None by default)
>> Victor
>> Le mar. 7 mai 2019 à 21:46, Paul Ganssle <paul at ganssle.io> a écrit :
>>> Greetings all,
>>> I have one last feature request that I'd like added to datetime for Python 3.8, and this one I think could use some more discussion, the addition of a "time zone index cache" to the datetime object. The rationale is laid out in detail in bpo-35723. The general problem is that currently, every invocation of utcoffset, tzname and dst needs to do full, independent calculations of the time zone offsets, even for time zones where the mapping is guaranteed to be stable because datetimes are immutable. I have a proof of concept implementation: PR #11529.
>>> I'm envisioning that the `datetime` class will add a private `_tzidx` single-byte member (it seems that this does not increase the size of the datetime object, because it's just using an unused alignment byte). `datetime` will also add a `tzidx()` method, which will return `_tzidx` if it's been set and otherwise it will call `self.tzinfo.tzidx()`.  If `self.tzinfo.tzidx()` returns a number between 0 and 254 (inclusive), it sets `_tzidx` to this value. tzidx() then returns whatever self.tzinfo.tzidx() returned.
>>> The value of this is that as far as I can tell, nearly all non-trivial tzinfo implementations construct a list of possible offsets, and implement utcoffset(), tzname() and dst() by calculating an index into that list and returning it. There are almost always less than 255 distinct offsets. By adding this cache on the datetime, we're using a small amount of currently-unused memory to prevent unnecessary calculations about a given datetime. The feature is entirely opt-in, and has no downsides if it goes unused, and it makes it possible to write tzinfo implementations that are both lazy and as fast as the "eager calculation" mode that pytz uses (and that causes many problems for pytz's users).
>>> I have explored the idea of using an lru cache of some sort on the tzinfo object itself, but there are two problems with this:
>>> 1. Calculating the hash of a datetime calls .utcoffset(), which means that it is necessary to, at minimum, do a `replace` on the datetime (and constructing a new datetime is a pretty considerable speed hit)
>>> 2. It will be a much bigger memory cost, since my current proposal uses approximately zero additional memory (not sure if the alignment stuff is platform-dependent or something, but it doesn't use additional memory on my linux computer).
>>> I realize this proposal is somewhat difficult to wrap your head around, so if anyone would like to chat with me about it in person, I'll be at PyCon sprints until Thursday morning.
>>> Best,
>>> Paul
>>> _______________________________________________
>>> Python-Dev mailing list
>>> Python-Dev at python.org
>>> https://mail.python.org/mailman/listinfo/python-dev
>>> Unsubscribe: https://mail.python.org/mailman/options/python-dev/vstinner%40redhat.com
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/mal%40egenix.com

More information about the Python-Dev mailing list