[Distutils] Entry points: specifying and caching

Doug Hellmann doug at doughellmann.com
Wed Oct 18 12:48:04 EDT 2017


Excerpts from Thomas Kluyver's message of 2017-10-18 15:52:00 +0100:
> We're increasingly using entry points in Jupyter to help integrate
> third-party components. This brings up a couple of things that I'd like
> to do:
> 
> 1. Specification
> 
> As far as I know, there's no document describing the details of entry
> points; it's a de-facto standard established by setuptools. It seems to
> work quite well, but it's worth writing down what is unofficially
> standardised. I would like to see a document on
> https://packaging.python.org/specifications/ saying:
> 
> - Where build tools should put entry points in wheels
> - Where entry points live in installed distributions
> - The file format (including allowed characters, case sensitivity...)
> 
> I guess I'm volunteering to write this, although if someone else wants
> to, don't let me stop you. ;-)
> 
> I'd also be happy to hear that I'm wrong, that this specification
> already exists somewhere. If it does, can we add a link from
> https://packaging.python.org/specifications/ ?

I've always used the setuptools documentation as a reference. Are you
suggesting moving that information to a different location to
allow/encourage other tools to implement it as a standard?

> 2. Caching
> 
> "There are only two hard problems in computer science: cache
> invalidation, naming things, and off-by-one errors"
> 
> I know that caching is going to make things more complex, but at present
> a scan of available entry points requires a stat() for every installed
> package, plus open()+read()+parse for every installed package that
> provides entry points. This doesn't scale well, especially on spinning
> hard drives. By eliminating a call to pygments which caused an entry
> points scan, we cut the cold-start time of IPython almost in half on one
> HDD system (11s -> 6s; PR 10859).
> 
> As packaging improves, the trend is to break functionality into more,
> smaller packages, which is only going to make this worse (though I hope
> we never end up with a left-pad package ;-). Caching could allow entry
> points to be used in places where the current performance penalty is too
> much.
> 
> I envisage a cache working something like this:
> - Each directory on sys.path can have a cache file, e.g.
> 'entry-points.json'
> - I suggest JSON because Python can parse it efficiently, and it's not
> intended to be directly edited by humans. Other options? SQLite? Does
> someone want to do performance comparisons?
> - There is a command to scan all packages in a directory and build the
> cache file
> - After an install tool (e.g. pip) has added/removed packages from a
> directory, it should call that command to rebuild the cache.
> - A second command goes through all directories on sys.path and rebuilds
> their cache files - this lets the user rebuild caches if something has
> gone wrong.
> - Applications looking for entry points can choose from a range of
> behaviours depending on how important accuracy and performance are. E.g.
> ignore all caches, only use caches, use caches for directories where
> they exist, or try caches first and then scan packages if a key is
> missing.
> 
> In the best case, when the caches exist and you trust them, loading them
> would cost one set of filesystem operations per sys.path entry, rather
> than per package.
> 
> Thanks,
> Thomas

We've run into similar issues in some applications I work on. I had
intended to implement a caching layer within stevedore
(https://docs.openstack.org/stevedore/latest/) as a first step for
experimenting with approaches, but I would be happy to collaborate on
something further upstream if there's interest.

Doug


More information about the Distutils-SIG mailing list