On 27 October 2017 at 01:45, Thomas Kluyver <thomas@kluyver.me.uk> wrote:

On Thu, Oct 26, 2017, at 03:57 PM, Daniel Holth wrote:
Would something as simple as a file per sys.path with the 'last modified by installer' date be helpful? You could check those to determine whether your cache was out of date.

I wonder if we could use the directory mtime for this? It's only really useful if we can be confident that all installer tools will update it.

There are lots of options for this, and one thing worth keeping in mind is compatibility with the "monolithic system package" model, where the entire preconfigured virtual environment gets archived, and then dropped into place on the target system. In such cases, filesystem level mtimes may change *without* the entry point cache actually being out of date.

In that context, it's worth keeping in mind what the actual goals of the cache will be:

1. The entry point cache should ideally reflect the state of installed components in a given execution environment at the time of access. If this is not true, installing a component may require explicit cache invalidation/rebuilding to get things back to a consistent state (similar to the way a call to importlib.invalidate_caches() is needed to reliably see filesystem changes)
2. Checking for available entry points in a given group should be consistently cheap (ideally O(1)), rather than scaling with the number of packages installed or the number of sys.path entries

Given those goals, there are a number of different points in time where the cache can be generated, each with different trade-offs between how reliably fresh the cache is, and how frequently you have to rebuild the cache.

Option 1: in-memory cache

* Pro: consistent with the way importlib caches work
* Pro: automatically adjusts to sys.path changes
* Pro: will likely be needed regardless to handle per-path-entry caches with other methods
* Con: every process incurs at least 1 linear DB read
* Con: zero pay-off if you only query one entry point group
* Con: requires explicit invalidation to pick up filesystem changes (but can hook into importlib.invalidate_caches())

Option 2: temporary (or persistent) per-user-session cache

* Pro: only the first query per path entry per user session incurs a linear DB read
* Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even that overhead can be avoided
* Pro: sys.path directory mtimes are sufficient for cache invalidation (subject to filesystem timestamp granularity)
* Pro: zero elevated privileges needed (cache would be stored in a per-user directory tree)
* Con: interprocess locking likely needed to avoid the "thundering herd" cache update problem [1]
* Con: if a non-persistent storage location is used, zero benefit over an in-memory cache for throwaway environments (e.g. container startup)
* Con: cost of the cache freshness check will still scale linearly with the number of sys.path entries

Option 3: persistent per-path-entry cache

* Pro: assuming cache freshness means zero runtime queries incur a linear DB read (cache creation becomes an install time cost)
* Con: if you don't assume cache freshness, you need option 1 or 2 anyway, and the install time cache just speeds up that first linear read
* Con: filesystem access control requires either explicit cache refresh or implicit metadata caching support in installers
* Con: sys.path directory mtimes are no longer sufficient for cache invalidation (due to potential for directory relocation)
* Con: interprocess locking arguably still needed to avoid the "thundering herd" cache update problem (just between installers rather than runtime processes)

Given those trade-offs, I think it would probably make the most sense to start out by exploring a combination of options 1 & 2, and then only explore option 3 based on demonstrated performance problems with a per-user-session caching model. My rationale for that is that even in an image based "immutable infrastructure" deployment model, it's often entirely feasible to preseed runtime caches as part of the build process, and in cases where that *isn't* possible, you're likely also going to have trouble generating per-path-entry caches.


[1] https://en.wikipedia.org/wiki/Thundering_herd_problem

Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia