[Distutils] Entry points: specifying and caching

Nick Coghlan ncoghlan at gmail.com
Fri Oct 27 00:02:52 EDT 2017

On 27 October 2017 at 01:45, Thomas Kluyver <thomas at kluyver.me.uk> wrote:

> On Thu, Oct 26, 2017, at 03:57 PM, Daniel Holth wrote:
> Would something as simple as a file per sys.path with the 'last modified
> by installer' date be helpful? You could check those to determine whether
> your cache was out of date.
> I wonder if we could use the directory mtime for this? It's only really
> useful if we can be confident that all installer tools will update it.

There are lots of options for this, and one thing worth keeping in mind is
compatibility with the "monolithic system package" model, where the entire
preconfigured virtual environment gets archived, and then dropped into
place on the target system. In such cases, filesystem level mtimes may
change *without* the entry point cache actually being out of date.

In that context, it's worth keeping in mind what the actual goals of the
cache will be:

1. The entry point cache should ideally reflect the state of installed
components in a given execution environment at the time of access. If this
is not true, installing a component may require explicit cache
invalidation/rebuilding to get things back to a consistent state (similar
to the way a call to importlib.invalidate_caches() is needed to reliably
see filesystem changes)
2. Checking for available entry points in a given group should be
consistently cheap (ideally O(1)), rather than scaling with the number of
packages installed or the number of sys.path entries

Given those goals, there are a number of different points in time where the
cache can be generated, each with different trade-offs between how reliably
fresh the cache is, and how frequently you have to rebuild the cache.

Option 1: in-memory cache

* Pro: consistent with the way importlib caches work
* Pro: automatically adjusts to sys.path changes
* Pro: will likely be needed regardless to handle per-path-entry caches
with other methods
* Con: every process incurs at least 1 linear DB read
* Con: zero pay-off if you only query one entry point group
* Con: requires explicit invalidation to pick up filesystem changes (but
can hook into importlib.invalidate_caches())

Option 2: temporary (or persistent) per-user-session cache

* Pro: only the first query per path entry per user session incurs a linear
DB read
* Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even
that overhead can be avoided
* Pro: sys.path directory mtimes are sufficient for cache invalidation
(subject to filesystem timestamp granularity)
* Pro: zero elevated privileges needed (cache would be stored in a per-user
directory tree)
* Con: interprocess locking likely needed to avoid the "thundering herd"
cache update problem [1]
* Con: if a non-persistent storage location is used, zero benefit over an
in-memory cache for throwaway environments (e.g. container startup)
* Con: cost of the cache freshness check will still scale linearly with the
number of sys.path entries

Option 3: persistent per-path-entry cache

* Pro: assuming cache freshness means zero runtime queries incur a linear
DB read (cache creation becomes an install time cost)
* Con: if you don't assume cache freshness, you need option 1 or 2 anyway,
and the install time cache just speeds up that first linear read
* Con: filesystem access control requires either explicit cache refresh or
implicit metadata caching support in installers
* Con: sys.path directory mtimes are no longer sufficient for cache
invalidation (due to potential for directory relocation)
* Con: interprocess locking arguably still needed to avoid the "thundering
herd" cache update problem (just between installers rather than runtime

Given those trade-offs, I think it would probably make the most sense to
start out by exploring a combination of options 1 & 2, and then only
explore option 3 based on demonstrated performance problems with a
per-user-session caching model. My rationale for that is that even in an
image based "immutable infrastructure" deployment model, it's often
entirely feasible to preseed runtime caches as part of the build process,
and in cases where that *isn't* possible, you're likely also going to have
trouble generating per-path-entry caches.


[1] https://en.wikipedia.org/wiki/Thundering_herd_problem

Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20171027/2534aa7c/attachment.html>

More information about the Distutils-SIG mailing list