Re: [Distutils] Entry points: specifying and caching

27 Oct 2017

      On Fri, Oct 27, 2017 at 5:34 AM, Nick Coghlan  wrote:
...
On 27 October 2017 at 18:10, Nathaniel Smith  wrote:
...
On Thu, Oct 26, 2017 at 9:02 PM, Nick Coghlan  wrote:
...
Option 2: temporary (or persistent) per-user-session cache
* Pro: only the first query per path entry per user session incurs a
linear
DB read
* Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even
that
overhead can be avoided
* Pro: sys.path directory mtimes are sufficient for cache invalidation
(subject to filesystem timestamp granularity)
Timestamp granularity is a solvable problem. You just have to be
careful not to write out the cache unless the directory mtime is
sufficiently far in the past, like 10 seconds old, say. (This is an
old trick that VCSes use to make commands like 'git status'
fast-and-reliable.)
Yeah, we just recently fixed a bug related to that in pyc file caching (If
you managed to modify and reload a source file multiple times in the same
second we could end up missing the later edits. The fix was to check the
source timestamp didn't match the current timestamp before actually updating
the cached copy on the filesystem)
...
This does mean you can get in a weird state where if the directory
mtime somehow gets set to the future, then start time starts sucking
because caching goes away.
For pyc files, we're able to avoid that by looking for cache *inconsistency*
without making any assumptions about which direction time moves - as long as
the source timestamp recorded in the file pyc doesn't match the source
file's mtime, we'll refresh the cache.
This is necessary to cope with things like version controlled directories,
where directory mtimes can easily go backwards because you switched branches
or reverted to an earlier version.
Yeah, this is a good idea, but it doesn't address the reason why some
systems refuse to update their caches when they see mtimes in the
future. The motivation there is that if the mtime is in the future,
then it's possible that at some point in the future, the mtime will
match the current time, and then if the directory is modified at that
moment, the cache will become silently invalid.

It's not clear how important this really is; you have to get somewhat
unlucky, and if you're seeing timestamps from the future then
timekeeping has obviously broken down somehow and nothing based on
mtimes can be reliable without reliable timekeeping. (For example,
even if the mtime seems to be in the past, the clock could get set
backwards and now the same mtime is in the future after all.) But
that's the reasoning I've seen.
...
The os module has atomic write support on Windows in 3.x now:
https://docs.python.org/3/library/os.html#os.replace
So the only problematic case is 2.7 on WIndows, and for that Christian
Heimes backported pyosreplace here: https://pypi.org/project/pyosreplace/
(The "may be non-atomic" case is the same situation where it will fail
outright on POSIX systems: when you're attempting to do the rename across
filesystems. If you stay within the same directory, which you want to do
anyway for permissions inheritance and automatic file labeling, it's
atomic).
I've never been able to tell whether this is trustworthy or not; MS
documents the rename-across-filesystems case as an *example* of a case
where it's non-atomic, and doesn't document any atomicity guarantees
either way. Is it really atomic on FAT filesystems? On network
filesystems? (Do all versions of CIFS even give a way to express file
replacement as a single operation?) But there's folklore saying it's
OK...

I guess in this case atomicity wouldn't be that crucial anyway though.
...
...
...
Option 3: persistent per-path-entry cache
* Pro: assuming cache freshness means zero runtime queries incur a
linear DB
read (cache creation becomes an install time cost)
* Con: if you don't assume cache freshness, you need option 1 or 2
anyway,
and the install time cache just speeds up that first linear read
* Con: filesystem access control requires either explicit cache refresh
or
implicit metadata caching support in installers
* Con: sys.path directory mtimes are no longer sufficient for cache
invalidation (due to potential for directory relocation)
Not sure what problem you're thinking of here? In this model we
wouldn't be using mtimes for cache invalidation anyway, because it'd
be the responsibility of those modifying the directory to update the
cache. And if you rename a whole directory, that doesn't affect its
mtime anyway?
Your second sentence is what I meant - whether the cache is still valid or
not is less about the mtime, and more about what other actions have been
performed. (It's much closer to the locate/updatedb model, where the runtime
part just assumes the cache is valid, and it's somebody else's problem to
ensure that assumption is reasonably valid)
Yeah. Which is probably the big issue with your third approach: it'll
probably work great if all installers are updated to properly manage
the cache. Explicit cache invalidation is fast and reliable and avoids
all these mtime shenanigans... if everyone implements it properly. But
currently there's lots of software that feels free to randomly dump
stuff into sys.path and doesn't know about the cache invalidation
thing (e.g. old versions of pip and setuptools), and that's a disaster
in a pure explicit invalidation model.

I guess in practice the solution for the transition period would be to
also store the mtime in the cache, so you can at least detect with
high probability when someone has used a legacy installer, and yell at
them to stop doing that. Though this might then cause problems if
people do stuff like zip up their site-packages and then unzip it
somewhere else, updating the mtimes in the process.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org