On 20 October 2017 at 02:14, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Thu, Oct 19, 2017, at 04:10 PM, Donald Stufft wrote:
> I’m in favor, although one question I guess is whether it should be a a
> PEP or an ad hoc spec. Given (2) it should *probably* be a a PEP (since
> without (2), its just another file in the .dist-info directory and that
> doesn’t actually need standardized at all). I don’t think that this will
> be a very controversial PEP though, and should be pretty easy.

I have opened a PR to document what is already there, without adding any
new features. I think this is worth doing even if we don't change
anything, since it's a de-facto standard used for different tools to
interact.

https://github.com/pypa/python-packaging-user-guide/pull/390

We can still write a PEP for caching if necessary.

+1 for that approach (PR for the status quo, PEP for a shared metadata caching design) from me

Making the status quo more discoverable is valuable in its own right, and the only decisions we'll need to make for that are terminology clarification ones, not interoperability ones (this isn't like PEP 440 or 508 where we actually thought some of the default setuptools behaviour was slightly incorrect and wanted to change it).

Figuring out a robust cross-platform network-file-system-tolerant metadata caching design on the other hand is going to be hard, and as Donald suggests, the right ecosystem level solution might be to define install-time hooks for package installation operations.
 
> I’m also in favor of this. Although I would suggest SQLite rather than a
> JSON file for the primary reason being that a JSON file isn’t
> multiprocess safe without being careful (and possibly introducing
> locking) whereas SQLite has already solved that problem.

SQLite was actually my first thought, but from experience in Jupyter &
IPython I'm wary of it - its built-in locking does not work well over
NFS, and it's easy to corrupt the database. I think careful use of
atomic writing can be more reliable (though that has given us some
problems too).

That may be easier if there's one cache per user, though - we can
perhaps try to store it somewhere that's not NFS.

I'm wondering if rather than jumping straight to a PEP, it may make sense to instead initially pursue this idea as a *non-*standard, implementation dependent thing specific to the "entrypoints" project. There are a *lot* of challenges to be taken into account for a truly universal metadata caching design, and it would be easy to fall into the trap of coming up with a design so complex that nobody can realistically implement it.

Specifically, I'm thinking of a usage model along the lines of the updatedb/locate pair on *nix systems: `locate` gives you access to very fast searches of your filesystem, but it *doesn't* try to automagically keeps its indexes up to date. Instead, refreshing the indexes is handled by `updatedb`, and you can either rely on that being run automatically in a cron job, or else force an update with `sudo updatedb` when you want to use `locate`.

For a project like entrypoints, what that might look like is that at *runtime*, you may implement a reasonably fast "cache freshness check", where you scanned the mtime of all the sys.path entries, and compared those to the mtime of the cache. If the cache looks up to date, then cool, otherwise emit a warning about the stale metadata cache, and then bypass it.

The entrypoints project itself could then expose a `refresh-entrypoints-cache` command that could start out only supporting virtual environments, and then extend to per-user caching, and then finally (maybe) consider whether or not it wanted to support installation-wide caches (with the extra permissions management and cross-process and cross-system coordination that may imply).

Such an approach would also tie in nicely with Donald's suggestion of reframing the ecosystem level question as "How should the entrypoints project request that 'refresh-entrypoints-cache' be run after every package installation or removal operation?", which in turn would integrate nicely with things like RPM file triggers (where the system `pip` package could set a file trigger that arranged for any properly registered Python package installation plugins to be run for every modification to site-packages while still appropriately managing the risk of running arbitrary code with elevated privileges)

Cheers,
Nick.

--
Nick Coghlan   |   ncoghlan@gmail.com   |   Brisbane, Australia