[Distutils] Entry points: specifying and caching

Nathaniel Smith njs at pobox.com
Fri Oct 27 17:56:15 EDT 2017


On Fri, Oct 27, 2017 at 5:34 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
> On 27 October 2017 at 18:10, Nathaniel Smith <njs at pobox.com> wrote:
>>
>> On Thu, Oct 26, 2017 at 9:02 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> > Option 2: temporary (or persistent) per-user-session cache
>> >
>> > * Pro: only the first query per path entry per user session incurs a
>> > linear
>> > DB read
>> > * Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even
>> > that
>> > overhead can be avoided
>> > * Pro: sys.path directory mtimes are sufficient for cache invalidation
>> > (subject to filesystem timestamp granularity)
>>
>> Timestamp granularity is a solvable problem. You just have to be
>> careful not to write out the cache unless the directory mtime is
>> sufficiently far in the past, like 10 seconds old, say. (This is an
>> old trick that VCSes use to make commands like 'git status'
>> fast-and-reliable.)
>
>
> Yeah, we just recently fixed a bug related to that in pyc file caching (If
> you managed to modify and reload a source file multiple times in the same
> second we could end up missing the later edits. The fix was to check the
> source timestamp didn't match the current timestamp before actually updating
> the cached copy on the filesystem)
>
>>
>> This does mean you can get in a weird state where if the directory
>> mtime somehow gets set to the future, then start time starts sucking
>> because caching goes away.
>
>
> For pyc files, we're able to avoid that by looking for cache *inconsistency*
> without making any assumptions about which direction time moves - as long as
> the source timestamp recorded in the file pyc doesn't match the source
> file's mtime, we'll refresh the cache.
>
> This is necessary to cope with things like version controlled directories,
> where directory mtimes can easily go backwards because you switched branches
> or reverted to an earlier version.

Yeah, this is a good idea, but it doesn't address the reason why some
systems refuse to update their caches when they see mtimes in the
future. The motivation there is that if the mtime is in the future,
then it's possible that at some point in the future, the mtime will
match the current time, and then if the directory is modified at that
moment, the cache will become silently invalid.

It's not clear how important this really is; you have to get somewhat
unlucky, and if you're seeing timestamps from the future then
timekeeping has obviously broken down somehow and nothing based on
mtimes can be reliable without reliable timekeeping. (For example,
even if the mtime seems to be in the past, the clock could get set
backwards and now the same mtime is in the future after all.) But
that's the reasoning I've seen.

> The os module has atomic write support on Windows in 3.x now:
> https://docs.python.org/3/library/os.html#os.replace
>
> So the only problematic case is 2.7 on WIndows, and for that Christian
> Heimes backported pyosreplace here: https://pypi.org/project/pyosreplace/
>
> (The "may be non-atomic" case is the same situation where it will fail
> outright on POSIX systems: when you're attempting to do the rename across
> filesystems. If you stay within the same directory, which you want to do
> anyway for permissions inheritance and automatic file labeling, it's
> atomic).

I've never been able to tell whether this is trustworthy or not; MS
documents the rename-across-filesystems case as an *example* of a case
where it's non-atomic, and doesn't document any atomicity guarantees
either way. Is it really atomic on FAT filesystems? On network
filesystems? (Do all versions of CIFS even give a way to express file
replacement as a single operation?) But there's folklore saying it's
OK...

I guess in this case atomicity wouldn't be that crucial anyway though.

>> > Option 3: persistent per-path-entry cache
>> >
>> > * Pro: assuming cache freshness means zero runtime queries incur a
>> > linear DB
>> > read (cache creation becomes an install time cost)
>> > * Con: if you don't assume cache freshness, you need option 1 or 2
>> > anyway,
>> > and the install time cache just speeds up that first linear read
>> > * Con: filesystem access control requires either explicit cache refresh
>> > or
>> > implicit metadata caching support in installers
>> > * Con: sys.path directory mtimes are no longer sufficient for cache
>> > invalidation (due to potential for directory relocation)
>>
>> Not sure what problem you're thinking of here? In this model we
>> wouldn't be using mtimes for cache invalidation anyway, because it'd
>> be the responsibility of those modifying the directory to update the
>> cache. And if you rename a whole directory, that doesn't affect its
>> mtime anyway?
>
>
> Your second sentence is what I meant - whether the cache is still valid or
> not is less about the mtime, and more about what other actions have been
> performed. (It's much closer to the locate/updatedb model, where the runtime
> part just assumes the cache is valid, and it's somebody else's problem to
> ensure that assumption is reasonably valid)

Yeah. Which is probably the big issue with your third approach: it'll
probably work great if all installers are updated to properly manage
the cache. Explicit cache invalidation is fast and reliable and avoids
all these mtime shenanigans... if everyone implements it properly. But
currently there's lots of software that feels free to randomly dump
stuff into sys.path and doesn't know about the cache invalidation
thing (e.g. old versions of pip and setuptools), and that's a disaster
in a pure explicit invalidation model.

I guess in practice the solution for the transition period would be to
also store the mtime in the cache, so you can at least detect with
high probability when someone has used a legacy installer, and yell at
them to stop doing that. Though this might then cause problems if
people do stuff like zip up their site-packages and then unzip it
somewhere else, updating the mtimes in the process.

-n

-- 
Nathaniel J. Smith -- https://vorpus.org


More information about the Distutils-SIG mailing list