Excerpts from Thomas Kluyver's message of 2017-10-18 15:52:00 +0100:
We're increasingly using entry points in Jupyter to help integrate third-party components. This brings up a couple of things that I'd like to do:
1. Specification
As far as I know, there's no document describing the details of entry points; it's a de-facto standard established by setuptools. It seems to work quite well, but it's worth writing down what is unofficially standardised. I would like to see a document on https://packaging.python.org/specifications/ saying:
- Where build tools should put entry points in wheels - Where entry points live in installed distributions - The file format (including allowed characters, case sensitivity...)
I guess I'm volunteering to write this, although if someone else wants to, don't let me stop you. ;-)
I'd also be happy to hear that I'm wrong, that this specification already exists somewhere. If it does, can we add a link from https://packaging.python.org/specifications/ ?
I've always used the setuptools documentation as a reference. Are you suggesting moving that information to a different location to allow/encourage other tools to implement it as a standard?
2. Caching
"There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors"
I know that caching is going to make things more complex, but at present a scan of available entry points requires a stat() for every installed package, plus open()+read()+parse for every installed package that provides entry points. This doesn't scale well, especially on spinning hard drives. By eliminating a call to pygments which caused an entry points scan, we cut the cold-start time of IPython almost in half on one HDD system (11s -> 6s; PR 10859).
As packaging improves, the trend is to break functionality into more, smaller packages, which is only going to make this worse (though I hope we never end up with a left-pad package ;-). Caching could allow entry points to be used in places where the current performance penalty is too much.
I envisage a cache working something like this: - Each directory on sys.path can have a cache file, e.g. 'entry-points.json' - I suggest JSON because Python can parse it efficiently, and it's not intended to be directly edited by humans. Other options? SQLite? Does someone want to do performance comparisons? - There is a command to scan all packages in a directory and build the cache file - After an install tool (e.g. pip) has added/removed packages from a directory, it should call that command to rebuild the cache. - A second command goes through all directories on sys.path and rebuilds their cache files - this lets the user rebuild caches if something has gone wrong. - Applications looking for entry points can choose from a range of behaviours depending on how important accuracy and performance are. E.g. ignore all caches, only use caches, use caches for directories where they exist, or try caches first and then scan packages if a key is missing.
In the best case, when the caches exist and you trust them, loading them would cost one set of filesystem operations per sys.path entry, rather than per package.
Thanks, Thomas
We've run into similar issues in some applications I work on. I had intended to implement a caching layer within stevedore (https://docs.openstack.org/stevedore/latest/) as a first step for experimenting with approaches, but I would be happy to collaborate on something further upstream if there's interest. Doug