On 27 October 2017 at 18:10, Nathaniel Smith email@example.com wrote:
On Thu, Oct 26, 2017 at 9:02 PM, Nick Coghlan firstname.lastname@example.org wrote:
Option 2: temporary (or persistent) per-user-session cache
- Pro: only the first query per path entry per user session incurs a
- Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even
overhead can be avoided
- Pro: sys.path directory mtimes are sufficient for cache invalidation
(subject to filesystem timestamp granularity)
Timestamp granularity is a solvable problem. You just have to be careful not to write out the cache unless the directory mtime is sufficiently far in the past, like 10 seconds old, say. (This is an old trick that VCSes use to make commands like 'git status' fast-and-reliable.)
Yeah, we just recently fixed a bug related to that in pyc file caching (If you managed to modify and reload a source file multiple times in the same second we could end up missing the later edits. The fix was to check the source timestamp didn't match the current timestamp before actually updating the cached copy on the filesystem)
This does mean you can get in a weird state where if the directory mtime somehow gets set to the future, then start time starts sucking because caching goes away.
For pyc files, we're able to avoid that by looking for cache *inconsistency* without making any assumptions about which direction time moves - as long as the source timestamp recorded in the file pyc doesn't match the source file's mtime, we'll refresh the cache.
This is necessary to cope with things like version controlled directories, where directory mtimes can easily go backwards because you switched branches or reverted to an earlier version.
Note also that you'll want to explicitly write the observed directory mtime to the cache file, rather than comparing it to the cache file's mtime, to avoid the race condition where the directory gets modified just after we scan it but before we write out the cache.
- Pro: zero elevated privileges needed (cache would be stored in a
- Con: interprocess locking likely needed to avoid the "thundering herd"
cache update problem 
Interprocess filesystem locking is going to be far more painful than any problem it might solve. Seriously. At least on Unix, the right approach is to go ahead and regenerate the cache, and then atomically write it to the given place, and if someone else overwrites it a few milliseconds later then oh well.
Aye, limiting the handling for this to the use of atomic writes is likely an entirely reasonable approach to take.
I guess on Windows locking might be OK, given that it has no atomic writes and less gratuitously broken filesystem locking.
The os module has atomic write support on Windows in 3.x now: https://docs.python.org/3/library/os.html#os.replace
So the only problematic case is 2.7 on WIndows, and for that Christian Heimes backported pyosreplace here: https://pypi.org/project/pyosreplace/
(The "may be non-atomic" case is the same situation where it will fail outright on POSIX systems: when you're attempting to do the rename across filesystems. If you stay within the same directory, which you want to do anyway for permissions inheritance and automatic file labeling, it's atomic).
still want to make sure you never block when acquiring the lock; if the lock is already taken because someone else is in the middle of updating the cache, then you need to fall back on doing a linear scan. This is explicitly *not* avoiding the thundering herd problem, because it's more important to avoid the "one process got stuck and now everyone else freezes on startup waiting for it" problem.
- Con: if a non-persistent storage location is used, zero benefit over an
in-memory cache for throwaway environments (e.g. container startup)
You also have to be careful about whether you have a writeable storage location at all, and if so whether you have the right permissions. (It might be bad if 'sudo somescript.py' leaves me with root-owned cache files in /home/njs/.cache/.)
Filesystems are just a barrel of fun.
C'mon, who doesn't enjoy debugging SELinux file labeling problems arising from mounting symlinked host directories into Docker containers running as root internally? :)
- Con: cost of the cache freshness check will still scale linearly with
number of sys.path entries
Option 3: persistent per-path-entry cache
- Pro: assuming cache freshness means zero runtime queries incur a
read (cache creation becomes an install time cost)
- Con: if you don't assume cache freshness, you need option 1 or 2
and the install time cache just speeds up that first linear read
- Con: filesystem access control requires either explicit cache refresh
implicit metadata caching support in installers
- Con: sys.path directory mtimes are no longer sufficient for cache
invalidation (due to potential for directory relocation)
Not sure what problem you're thinking of here? In this model we wouldn't be using mtimes for cache invalidation anyway, because it'd be the responsibility of those modifying the directory to update the cache. And if you rename a whole directory, that doesn't affect its mtime anyway?
Your second sentence is what I meant - whether the cache is still valid or not is less about the mtime, and more about what other actions have been performed. (It's much closer to the locate/updatedb model, where the runtime part just assumes the cache is valid, and it's somebody else's problem to ensure that assumption is reasonably valid)
- Con: interprocess locking arguably still needed to avoid the
herd" cache update problem (just between installers rather than runtime processes)
If two installers are trying to rearrange the same directory at the same time then they can conflict in lots of ways. For the most part people get away with it because doing multiple 'pip install' runs in parallel is generally considered a Bad Idea and unlikely to happen by accident; and if it is a problem then we should add locking anyway (like dpkg and rpm already do), regardless of the cache update part.
Another fair point.