On Thu, Oct 26, 2017 at 9:02 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Option 2: temporary (or persistent) per-user-session cache
* Pro: only the first query per path entry per user session incurs a linear DB read * Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even that overhead can be avoided * Pro: sys.path directory mtimes are sufficient for cache invalidation (subject to filesystem timestamp granularity)
Timestamp granularity is a solvable problem. You just have to be careful not to write out the cache unless the directory mtime is sufficiently far in the past, like 10 seconds old, say. (This is an old trick that VCSes use to make commands like 'git status' fast-and-reliable.) This does mean you can get in a weird state where if the directory mtime somehow gets set to the future, then start time starts sucking because caching goes away. Note also that you'll want to explicitly write the observed directory mtime to the cache file, rather than comparing it to the cache file's mtime, to avoid the race condition where the directory gets modified just after we scan it but before we write out the cache.
* Pro: zero elevated privileges needed (cache would be stored in a per-user directory tree) * Con: interprocess locking likely needed to avoid the "thundering herd" cache update problem [1]
Interprocess filesystem locking is going to be far more painful than any problem it might solve. Seriously. At least on Unix, the right approach is to go ahead and regenerate the cache, and then atomically write it to the given place, and if someone else overwrites it a few milliseconds later then oh well. I guess on Windows locking might be OK, given that it has no atomic writes and less gratuitously broken filesystem locking. But you'd still want to make sure you never block when acquiring the lock; if the lock is already taken because someone else is in the middle of updating the cache, then you need to fall back on doing a linear scan. This is explicitly *not* avoiding the thundering herd problem, because it's more important to avoid the "one process got stuck and now everyone else freezes on startup waiting for it" problem.
* Con: if a non-persistent storage location is used, zero benefit over an in-memory cache for throwaway environments (e.g. container startup)
You also have to be careful about whether you have a writeable storage location at all, and if so whether you have the right permissions. (It might be bad if 'sudo somescript.py' leaves me with root-owned cache files in /home/njs/.cache/.) Filesystems are just a barrel of fun.
* Con: cost of the cache freshness check will still scale linearly with the number of sys.path entries
Option 3: persistent per-path-entry cache
* Pro: assuming cache freshness means zero runtime queries incur a linear DB read (cache creation becomes an install time cost) * Con: if you don't assume cache freshness, you need option 1 or 2 anyway, and the install time cache just speeds up that first linear read * Con: filesystem access control requires either explicit cache refresh or implicit metadata caching support in installers * Con: sys.path directory mtimes are no longer sufficient for cache invalidation (due to potential for directory relocation)
Not sure what problem you're thinking of here? In this model we wouldn't be using mtimes for cache invalidation anyway, because it'd be the responsibility of those modifying the directory to update the cache. And if you rename a whole directory, that doesn't affect its mtime anyway?
* Con: interprocess locking arguably still needed to avoid the "thundering herd" cache update problem (just between installers rather than runtime processes)
If two installers are trying to rearrange the same directory at the same time then they can conflict in lots of ways. For the most part people get away with it because doing multiple 'pip install' runs in parallel is generally considered a Bad Idea and unlikely to happen by accident; and if it is a problem then we should add locking anyway (like dpkg and rpm already do), regardless of the cache update part. -n -- Nathaniel J. Smith -- https://vorpus.org