[Distutils] Entry points: specifying and caching

Fri Oct 27 08:34:12 EDT 2017

On 27 October 2017 at 18:10, Nathaniel Smith <njs at pobox.com> wrote:

> On Thu, Oct 26, 2017 at 9:02 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> > Option 2: temporary (or persistent) per-user-session cache
> >
> > * Pro: only the first query per path entry per user session incurs a
> linear
> > DB read
> > * Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even
> that
> > overhead can be avoided
> > * Pro: sys.path directory mtimes are sufficient for cache invalidation
> > (subject to filesystem timestamp granularity)
>
> Timestamp granularity is a solvable problem. You just have to be
> careful not to write out the cache unless the directory mtime is
> sufficiently far in the past, like 10 seconds old, say. (This is an
> old trick that VCSes use to make commands like 'git status'
> fast-and-reliable.)
>

Yeah, we just recently fixed a bug related to that in pyc file caching (If
you managed to modify and reload a source file multiple times in the same
second we could end up missing the later edits. The fix was to check the
source timestamp didn't match the current timestamp before actually
updating the cached copy on the filesystem)

> This does mean you can get in a weird state where if the directory
> mtime somehow gets set to the future, then start time starts sucking
> because caching goes away.
>

For pyc files, we're able to avoid that by looking for cache
*inconsistency* without making any assumptions about which direction time
moves - as long as the source timestamp recorded in the file pyc doesn't
match the source file's mtime, we'll refresh the cache.

This is necessary to cope with things like version controlled directories,
where directory mtimes can easily go backwards because you switched
branches or reverted to an earlier version.

>
> Note also that you'll want to explicitly write the observed directory
> mtime to the cache file, rather than comparing it to the cache file's
> mtime, to avoid the race condition where the directory gets modified
> just after we scan it but before we write out the cache.
>
> > * Pro: zero elevated privileges needed (cache would be stored in a
> per-user
> > directory tree)
> > * Con: interprocess locking likely needed to avoid the "thundering herd"
> > cache update problem [1]
>
> Interprocess filesystem locking is going to be far more painful than
> any problem it might solve. Seriously. At least on Unix, the right
> approach is to go ahead and regenerate the cache, and then atomically
> write it to the given place, and if someone else overwrites it a few
> milliseconds later then oh well.
>

Aye, limiting the handling for this to the use of atomic writes is likely
an entirely reasonable approach to take.

> I guess on Windows locking might be OK, given that it has no atomic
> writes and less gratuitously broken filesystem locking.

The os module has atomic write support on Windows in 3.x now:
https://docs.python.org/3/library/os.html#os.replace

So the only problematic case is 2.7 on WIndows, and for that Christian
Heimes backported pyosreplace here: https://pypi.org/project/pyosreplace/

(The "may be non-atomic" case is the same situation where it will fail
outright on POSIX systems: when you're attempting to do the rename across
filesystems. If you stay within the same directory, which you want to do
anyway for permissions inheritance and automatic file labeling, it's
atomic).

But you'd
> still want to make sure you never block when acquiring the lock; if
> the lock is already taken because someone else is in the middle of
> updating the cache, then you need to fall back on doing a linear scan.
> This is explicitly *not* avoiding the thundering herd problem, because
> it's more important to avoid the "one process got stuck and now
> everyone else freezes on startup waiting for it" problem.
>

Fair point.

> > * Con: if a non-persistent storage location is used, zero benefit over an
> > in-memory cache for throwaway environments (e.g. container startup)
>
> You also have to be careful about whether you have a writeable storage
> location at all, and if so whether you have the right permissions. (It
> might be bad if 'sudo somescript.py' leaves me with root-owned cache
> files in /home/njs/.cache/.)
>
> Filesystems are just a barrel of fun.
>

C'mon, who doesn't enjoy debugging SELinux file labeling problems arising
from mounting symlinked host directories into Docker containers running as
root internally? :)

>
> > * Con: cost of the cache freshness check will still scale linearly with
> the
> > number of sys.path entries
> >
> > Option 3: persistent per-path-entry cache
> >
> > * Pro: assuming cache freshness means zero runtime queries incur a
> linear DB
> > read (cache creation becomes an install time cost)
> > * Con: if you don't assume cache freshness, you need option 1 or 2
> anyway,
> > and the install time cache just speeds up that first linear read
> > * Con: filesystem access control requires either explicit cache refresh
> or
> > implicit metadata caching support in installers
> > * Con: sys.path directory mtimes are no longer sufficient for cache
> > invalidation (due to potential for directory relocation)
>
> Not sure what problem you're thinking of here? In this model we
> wouldn't be using mtimes for cache invalidation anyway, because it'd
> be the responsibility of those modifying the directory to update the
> cache. And if you rename a whole directory, that doesn't affect its
> mtime anyway?
>

Your second sentence is what I meant - whether the cache is still valid or
not is less about the mtime, and more about what other actions have been
performed. (It's much closer to the locate/updatedb model, where the
runtime part just assumes the cache is valid, and it's somebody else's
problem to ensure that assumption is reasonably valid)

> > * Con: interprocess locking arguably still needed to avoid the
> "thundering
> > herd" cache update problem (just between installers rather than runtime
> > processes)
>
> If two installers are trying to rearrange the same directory at the
> same time then they can conflict in lots of ways. For the most part
> people get away with it because doing multiple 'pip install' runs in
> parallel is generally considered a Bad Idea and unlikely to happen by
> accident; and if it is a problem then we should add locking anyway
> (like dpkg and rpm already do), regardless of the cache update part.
>

Another fair point.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20171027/4bfa462f/attachment-0001.html>