On Fri, Oct 27, 2017 at 5:34 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 27 October 2017 at 18:10, Nathaniel Smith <njs@pobox.com> wrote:
On Thu, Oct 26, 2017 at 9:02 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Option 2: temporary (or persistent) per-user-session cache
* Pro: only the first query per path entry per user session incurs a linear DB read * Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even that overhead can be avoided * Pro: sys.path directory mtimes are sufficient for cache invalidation (subject to filesystem timestamp granularity)
Timestamp granularity is a solvable problem. You just have to be careful not to write out the cache unless the directory mtime is sufficiently far in the past, like 10 seconds old, say. (This is an old trick that VCSes use to make commands like 'git status' fast-and-reliable.)
Yeah, we just recently fixed a bug related to that in pyc file caching (If you managed to modify and reload a source file multiple times in the same second we could end up missing the later edits. The fix was to check the source timestamp didn't match the current timestamp before actually updating the cached copy on the filesystem)
This does mean you can get in a weird state where if the directory mtime somehow gets set to the future, then start time starts sucking because caching goes away.
For pyc files, we're able to avoid that by looking for cache *inconsistency* without making any assumptions about which direction time moves - as long as the source timestamp recorded in the file pyc doesn't match the source file's mtime, we'll refresh the cache.
This is necessary to cope with things like version controlled directories, where directory mtimes can easily go backwards because you switched branches or reverted to an earlier version.
Yeah, this is a good idea, but it doesn't address the reason why some systems refuse to update their caches when they see mtimes in the future. The motivation there is that if the mtime is in the future, then it's possible that at some point in the future, the mtime will match the current time, and then if the directory is modified at that moment, the cache will become silently invalid. It's not clear how important this really is; you have to get somewhat unlucky, and if you're seeing timestamps from the future then timekeeping has obviously broken down somehow and nothing based on mtimes can be reliable without reliable timekeeping. (For example, even if the mtime seems to be in the past, the clock could get set backwards and now the same mtime is in the future after all.) But that's the reasoning I've seen.
The os module has atomic write support on Windows in 3.x now: https://docs.python.org/3/library/os.html#os.replace
So the only problematic case is 2.7 on WIndows, and for that Christian Heimes backported pyosreplace here: https://pypi.org/project/pyosreplace/
(The "may be non-atomic" case is the same situation where it will fail outright on POSIX systems: when you're attempting to do the rename across filesystems. If you stay within the same directory, which you want to do anyway for permissions inheritance and automatic file labeling, it's atomic).
I've never been able to tell whether this is trustworthy or not; MS documents the rename-across-filesystems case as an *example* of a case where it's non-atomic, and doesn't document any atomicity guarantees either way. Is it really atomic on FAT filesystems? On network filesystems? (Do all versions of CIFS even give a way to express file replacement as a single operation?) But there's folklore saying it's OK... I guess in this case atomicity wouldn't be that crucial anyway though.
Option 3: persistent per-path-entry cache
* Pro: assuming cache freshness means zero runtime queries incur a linear DB read (cache creation becomes an install time cost) * Con: if you don't assume cache freshness, you need option 1 or 2 anyway, and the install time cache just speeds up that first linear read * Con: filesystem access control requires either explicit cache refresh or implicit metadata caching support in installers * Con: sys.path directory mtimes are no longer sufficient for cache invalidation (due to potential for directory relocation)
Not sure what problem you're thinking of here? In this model we wouldn't be using mtimes for cache invalidation anyway, because it'd be the responsibility of those modifying the directory to update the cache. And if you rename a whole directory, that doesn't affect its mtime anyway?
Your second sentence is what I meant - whether the cache is still valid or not is less about the mtime, and more about what other actions have been performed. (It's much closer to the locate/updatedb model, where the runtime part just assumes the cache is valid, and it's somebody else's problem to ensure that assumption is reasonably valid)
Yeah. Which is probably the big issue with your third approach: it'll probably work great if all installers are updated to properly manage the cache. Explicit cache invalidation is fast and reliable and avoids all these mtime shenanigans... if everyone implements it properly. But currently there's lots of software that feels free to randomly dump stuff into sys.path and doesn't know about the cache invalidation thing (e.g. old versions of pip and setuptools), and that's a disaster in a pure explicit invalidation model. I guess in practice the solution for the transition period would be to also store the mtime in the cache, so you can at least detect with high probability when someone has used a legacy installer, and yell at them to stop doing that. Though this might then cause problems if people do stuff like zip up their site-packages and then unzip it somewhere else, updating the mtimes in the process. -n -- Nathaniel J. Smith -- https://vorpus.org