On 27 October 2017 at 18:10, Nathaniel Smith <njs@pobox.com> wrote:

On Thu, Oct 26, 2017 at 9:02 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
> Option 2: temporary (or persistent) per-user-session cache
>
> * Pro: only the first query per path entry per user session incurs a linear
> DB read
> * Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even that
> overhead can be avoided
> * Pro: sys.path directory mtimes are sufficient for cache invalidation
> (subject to filesystem timestamp granularity)

Timestamp granularity is a solvable problem. You just have to be
careful not to write out the cache unless the directory mtime is
sufficiently far in the past, like 10 seconds old, say. (This is an
old trick that VCSes use to make commands like 'git status'
fast-and-reliable.)

Yeah, we just recently fixed a bug related to that in pyc file caching (If you managed to modify and reload a source file multiple times in the same second we could end up missing the later edits. The fix was to check the source timestamp didn't match the current timestamp before actually updating the cached copy on the filesystem)

This does mean you can get in a weird state where if the directory
mtime somehow gets set to the future, then start time starts sucking
because caching goes away.

For pyc files, we're able to avoid that by looking for cache *inconsistency* without making any assumptions about which direction time moves - as long as the source timestamp recorded in the file pyc doesn't match the source file's mtime, we'll refresh the cache.

This is necessary to cope with things like version controlled directories, where directory mtimes can easily go backwards because you switched branches or reverted to an earlier version.

Note also that you'll want to explicitly write the observed directory
mtime to the cache file, rather than comparing it to the cache file's
mtime, to avoid the race condition where the directory gets modified
just after we scan it but before we write out the cache.

> * Pro: zero elevated privileges needed (cache would be stored in a per-user
> directory tree)
> * Con: interprocess locking likely needed to avoid the "thundering herd"
> cache update problem [1]

Interprocess filesystem locking is going to be far more painful than
any problem it might solve. Seriously. At least on Unix, the right
approach is to go ahead and regenerate the cache, and then atomically
write it to the given place, and if someone else overwrites it a few
milliseconds later then oh well.

Aye, limiting the handling for this to the use of atomic writes is likely an entirely reasonable approach to take.

I guess on Windows locking might be OK, given that it has no atomic
writes and less gratuitously broken filesystem locking.

The os module has atomic write support on Windows in 3.x now: https://docs.python.org/3/library/os.html#os.replace

So the only problematic case is 2.7 on WIndows, and for that Christian Heimes backported pyosreplace here: https://pypi.org/project/pyosreplace/

(The "may be non-atomic" case is the same situation where it will fail outright on POSIX systems: when you're attempting to do the rename across filesystems. If you stay within the same directory, which you want to do anyway for permissions inheritance and automatic file labeling, it's atomic).

But you'd
still want to make sure you never block when acquiring the lock; if
the lock is already taken because someone else is in the middle of
updating the cache, then you need to fall back on doing a linear scan.
This is explicitly *not* avoiding the thundering herd problem, because
it's more important to avoid the "one process got stuck and now
everyone else freezes on startup waiting for it" problem.

Fair point.

> * Con: if a non-persistent storage location is used, zero benefit over an
> in-memory cache for throwaway environments (e.g. container startup)

You also have to be careful about whether you have a writeable storage
location at all, and if so whether you have the right permissions. (It
might be bad if 'sudo somescript.py' leaves me with root-owned cache
files in /home/njs/.cache/.)

Filesystems are just a barrel of fun.

C'mon, who doesn't enjoy debugging SELinux file labeling problems arising from mounting symlinked host directories into Docker containers running as root internally? :)

> * Con: cost of the cache freshness check will still scale linearly with the
> number of sys.path entries
>
> Option 3: persistent per-path-entry cache
>
> * Pro: assuming cache freshness means zero runtime queries incur a linear DB
> read (cache creation becomes an install time cost)
> * Con: if you don't assume cache freshness, you need option 1 or 2 anyway,
> and the install time cache just speeds up that first linear read
> * Con: filesystem access control requires either explicit cache refresh or
> implicit metadata caching support in installers
> * Con: sys.path directory mtimes are no longer sufficient for cache
> invalidation (due to potential for directory relocation)

Not sure what problem you're thinking of here? In this model we
wouldn't be using mtimes for cache invalidation anyway, because it'd
be the responsibility of those modifying the directory to update the
cache. And if you rename a whole directory, that doesn't affect its
mtime anyway?

Your second sentence is what I meant - whether the cache is still valid or not is less about the mtime, and more about what other actions have been performed. (It's much closer to the locate/updatedb model, where the runtime part just assumes the cache is valid, and it's somebody else's problem to ensure that assumption is reasonably valid)

> * Con: interprocess locking arguably still needed to avoid the "thundering
> herd" cache update problem (just between installers rather than runtime
> processes)

If two installers are trying to rearrange the same directory at the
same time then they can conflict in lots of ways. For the most part
people get away with it because doing multiple 'pip install' runs in
parallel is generally considered a Bad Idea and unlikely to happen by
accident; and if it is a problem then we should add locking anyway
(like dpkg and rpm already do), regardless of the cache update part.

Another fair point.

Cheers,

Nick.

Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia