Entry points: specifying and caching
We're increasingly using entry points in Jupyter to help integrate third-party components. This brings up a couple of things that I'd like to do: 1. Specification As far as I know, there's no document describing the details of entry points; it's a de-facto standard established by setuptools. It seems to work quite well, but it's worth writing down what is unofficially standardised. I would like to see a document on https://packaging.python.org/specifications/ saying: - Where build tools should put entry points in wheels - Where entry points live in installed distributions - The file format (including allowed characters, case sensitivity...) I guess I'm volunteering to write this, although if someone else wants to, don't let me stop you. ;-) I'd also be happy to hear that I'm wrong, that this specification already exists somewhere. If it does, can we add a link from https://packaging.python.org/specifications/ ? 2. Caching "There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors" I know that caching is going to make things more complex, but at present a scan of available entry points requires a stat() for every installed package, plus open()+read()+parse for every installed package that provides entry points. This doesn't scale well, especially on spinning hard drives. By eliminating a call to pygments which caused an entry points scan, we cut the cold-start time of IPython almost in half on one HDD system (11s -> 6s; PR 10859). As packaging improves, the trend is to break functionality into more, smaller packages, which is only going to make this worse (though I hope we never end up with a left-pad package ;-). Caching could allow entry points to be used in places where the current performance penalty is too much. I envisage a cache working something like this: - Each directory on sys.path can have a cache file, e.g. 'entry-points.json' - I suggest JSON because Python can parse it efficiently, and it's not intended to be directly edited by humans. Other options? SQLite? Does someone want to do performance comparisons? - There is a command to scan all packages in a directory and build the cache file - After an install tool (e.g. pip) has added/removed packages from a directory, it should call that command to rebuild the cache. - A second command goes through all directories on sys.path and rebuilds their cache files - this lets the user rebuild caches if something has gone wrong. - Applications looking for entry points can choose from a range of behaviours depending on how important accuracy and performance are. E.g. ignore all caches, only use caches, use caches for directories where they exist, or try caches first and then scan packages if a key is missing. In the best case, when the caches exist and you trust them, loading them would cost one set of filesystem operations per sys.path entry, rather than per package. Thanks, Thomas
Excerpts from Thomas Kluyver's message of 2017-10-18 15:52:00 +0100:
We're increasingly using entry points in Jupyter to help integrate third-party components. This brings up a couple of things that I'd like to do:
1. Specification
As far as I know, there's no document describing the details of entry points; it's a de-facto standard established by setuptools. It seems to work quite well, but it's worth writing down what is unofficially standardised. I would like to see a document on https://packaging.python.org/specifications/ saying:
- Where build tools should put entry points in wheels - Where entry points live in installed distributions - The file format (including allowed characters, case sensitivity...)
I guess I'm volunteering to write this, although if someone else wants to, don't let me stop you. ;-)
I'd also be happy to hear that I'm wrong, that this specification already exists somewhere. If it does, can we add a link from https://packaging.python.org/specifications/ ?
I've always used the setuptools documentation as a reference. Are you suggesting moving that information to a different location to allow/encourage other tools to implement it as a standard?
2. Caching
"There are only two hard problems in computer science: cache invalidation, naming things, and off-by-one errors"
I know that caching is going to make things more complex, but at present a scan of available entry points requires a stat() for every installed package, plus open()+read()+parse for every installed package that provides entry points. This doesn't scale well, especially on spinning hard drives. By eliminating a call to pygments which caused an entry points scan, we cut the cold-start time of IPython almost in half on one HDD system (11s -> 6s; PR 10859).
As packaging improves, the trend is to break functionality into more, smaller packages, which is only going to make this worse (though I hope we never end up with a left-pad package ;-). Caching could allow entry points to be used in places where the current performance penalty is too much.
I envisage a cache working something like this: - Each directory on sys.path can have a cache file, e.g. 'entry-points.json' - I suggest JSON because Python can parse it efficiently, and it's not intended to be directly edited by humans. Other options? SQLite? Does someone want to do performance comparisons? - There is a command to scan all packages in a directory and build the cache file - After an install tool (e.g. pip) has added/removed packages from a directory, it should call that command to rebuild the cache. - A second command goes through all directories on sys.path and rebuilds their cache files - this lets the user rebuild caches if something has gone wrong. - Applications looking for entry points can choose from a range of behaviours depending on how important accuracy and performance are. E.g. ignore all caches, only use caches, use caches for directories where they exist, or try caches first and then scan packages if a key is missing.
In the best case, when the caches exist and you trust them, loading them would cost one set of filesystem operations per sys.path entry, rather than per package.
Thanks, Thomas
We've run into similar issues in some applications I work on. I had intended to implement a caching layer within stevedore (https://docs.openstack.org/stevedore/latest/) as a first step for experimenting with approaches, but I would be happy to collaborate on something further upstream if there's interest. Doug
On 18 October 2017 at 17:48, Doug Hellmann <doug@doughellmann.com> wrote:
Excerpts from Thomas Kluyver's message of 2017-10-18 15:52:00 +0100:
We're increasingly using entry points in Jupyter to help integrate third-party components. This brings up a couple of things that I'd like to do:
1. Specification
As far as I know, there's no document describing the details of entry points; it's a de-facto standard established by setuptools. It seems to work quite well, but it's worth writing down what is unofficially standardised. I would like to see a document on https://packaging.python.org/specifications/ saying:
- Where build tools should put entry points in wheels - Where entry points live in installed distributions - The file format (including allowed characters, case sensitivity...)
I guess I'm volunteering to write this, although if someone else wants to, don't let me stop you. ;-)
I'd also be happy to hear that I'm wrong, that this specification already exists somewhere. If it does, can we add a link from https://packaging.python.org/specifications/ ?
I've always used the setuptools documentation as a reference. Are you suggesting moving that information to a different location to allow/encourage other tools to implement it as a standard?
I've never used entry points myself (other than the console script entry points supported by packaging) but a quick Google search found http://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery... as the only obvious candidate for documentation (and a bit later I thought of looking under pkg_resources and found http://setuptools.readthedocs.io/en/latest/pkg_resources.html#entry-points). This doesn't really say how the entry point data is stored in the project metadata, so it's not clear how I'd read that data in my own code (the answer is of course to use pkg_resources, but the point of documenting it as a standard is to allow alternative implementations). Also, it's not clear how a tool like flit might implement entry points - again, because the specifications don't describe how the metadata is stored. +1 from me on moving the entry point specification to https://packaging.python.org/specifications/ Paul
http://setuptools.readthedocs.io/en/latest/formats.html?highlight=entry_poin... http://setuptools.readthedocs.io/en/latest/pkg_resources.html?highlight=pkg_... It is not very complicated. It looks like the characters are mostly 'python identifier' rules with a little bit of 'package name' rules. I am also concerned about the amount of parsing on startup. A hard problem for certain, since no one likes outdated cache problems either. It is also unpleasant to have too much code with a runtime dependency on 'packaging'. On Wed, Oct 18, 2017 at 1:00 PM Paul Moore <p.f.moore@gmail.com> wrote:
On 18 October 2017 at 17:48, Doug Hellmann <doug@doughellmann.com> wrote:
Excerpts from Thomas Kluyver's message of 2017-10-18 15:52:00 +0100:
We're increasingly using entry points in Jupyter to help integrate third-party components. This brings up a couple of things that I'd like to do:
1. Specification
As far as I know, there's no document describing the details of entry points; it's a de-facto standard established by setuptools. It seems to work quite well, but it's worth writing down what is unofficially standardised. I would like to see a document on https://packaging.python.org/specifications/ saying:
- Where build tools should put entry points in wheels - Where entry points live in installed distributions - The file format (including allowed characters, case sensitivity...)
I guess I'm volunteering to write this, although if someone else wants to, don't let me stop you. ;-)
I'd also be happy to hear that I'm wrong, that this specification already exists somewhere. If it does, can we add a link from https://packaging.python.org/specifications/ ?
I've always used the setuptools documentation as a reference. Are you suggesting moving that information to a different location to allow/encourage other tools to implement it as a standard?
I've never used entry points myself (other than the console script entry points supported by packaging) but a quick Google search found
http://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery... as the only obvious candidate for documentation (and a bit later I thought of looking under pkg_resources and found http://setuptools.readthedocs.io/en/latest/pkg_resources.html#entry-points ). This doesn't really say how the entry point data is stored in the project metadata, so it's not clear how I'd read that data in my own code (the answer is of course to use pkg_resources, but the point of documenting it as a standard is to allow alternative implementations). Also, it's not clear how a tool like flit might implement entry points - again, because the specifications don't describe how the metadata is stored.
+1 from me on moving the entry point specification to https://packaging.python.org/specifications/
Paul _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
http://setuptools.readthedocs.io/en/latest/formats.html?highlight=entry_poin...
http://setuptools.readthedocs.io/en/latest/pkg_resources.html?highlight=pkg_...
It is not very complicated. It looks like the characters are mostly 'python identifier' rules with a little bit of 'package name' rules.
I am also concerned about the amount of parsing on startup. A hard problem for certain, since no one likes outdated cache problems either. It is also unpleasant to have too much code with a runtime dependency on 'packaging'. Wasn't someone working on implementing pkg_resources in the standard
Daniel Holth kirjoitti 18.10.2017 klo 21:06: library at some point?
On Wed, Oct 18, 2017 at 1:00 PM Paul Moore <p.f.moore@gmail.com <mailto:p.f.moore@gmail.com>> wrote:
On 18 October 2017 at 17:48, Doug Hellmann <doug@doughellmann.com <mailto:doug@doughellmann.com>> wrote: > Excerpts from Thomas Kluyver's message of 2017-10-18 15:52:00 +0100: >> We're increasingly using entry points in Jupyter to help integrate >> third-party components. This brings up a couple of things that I'd like >> to do: >> >> 1. Specification >> >> As far as I know, there's no document describing the details of entry >> points; it's a de-facto standard established by setuptools. It seems to >> work quite well, but it's worth writing down what is unofficially >> standardised. I would like to see a document on >> https://packaging.python.org/specifications/ saying: >> >> - Where build tools should put entry points in wheels >> - Where entry points live in installed distributions >> - The file format (including allowed characters, case sensitivity...) >> >> I guess I'm volunteering to write this, although if someone else wants >> to, don't let me stop you. ;-) >> >> I'd also be happy to hear that I'm wrong, that this specification >> already exists somewhere. If it does, can we add a link from >> https://packaging.python.org/specifications/ ? > > I've always used the setuptools documentation as a reference. Are you > suggesting moving that information to a different location to > allow/encourage other tools to implement it as a standard?
I've never used entry points myself (other than the console script entry points supported by packaging) but a quick Google search found http://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery... as the only obvious candidate for documentation (and a bit later I thought of looking under pkg_resources and found http://setuptools.readthedocs.io/en/latest/pkg_resources.html#entry-points). This doesn't really say how the entry point data is stored in the project metadata, so it's not clear how I'd read that data in my own code (the answer is of course to use pkg_resources, but the point of documenting it as a standard is to allow alternative implementations). Also, it's not clear how a tool like flit might implement entry points - again, because the specifications don't describe how the metadata is stored.
+1 from me on moving the entry point specification to https://packaging.python.org/specifications/
Paul _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org <mailto:Distutils-SIG@python.org> https://mail.python.org/mailman/listinfo/distutils-sig
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On Wed, Oct 18, 2017, at 05:59 PM, Paul Moore wrote:
I've always used the setuptools documentation as a reference. Are you suggesting moving that information to a different location to allow/encourage other tools to implement it as a standard?
I've never used entry points myself (other than the console script entry points supported by packaging) but a quick Google search found http://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery... as the only obvious candidate for documentation (and a bit later I thought of looking under pkg_resources and found http://setuptools.readthedocs.io/en/latest/pkg_resources.html#entry-points). This doesn't really say how the entry point data is stored in the project metadata, so it's not clear how I'd read that data in my own code (the answer is of course to use pkg_resources, but the point of documenting it as a standard is to allow alternative implementations).
I have in fact made an alternative implementation (PyPI package entrypoints) by 'reverse engineering' the format. A simple text-based format doesn't really justify the term 'reverse engineering', but for instance it wasn't obvious to me that the names were case sensitive, whereas Python's standard config parser treats keys as case-insensitive. Daniel:
http://setuptools.readthedocs.io/en/latest/formats.html?highlight=entry_poin...
Thanks, this link is closer than any I found to a specification. There are docs on how to create entry points in setup.py and how to use them with pkg_resources, but that's the only bit I've seen that describes the interchange file format. I think we can probably expand on it a bit, though! I'll try to put together something for packaging.python.org. Thomas
On 18 October 2017 at 19:42, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Wed, Oct 18, 2017, at 05:59 PM, Paul Moore wrote:
I've always used the setuptools documentation as a reference. Are you suggesting moving that information to a different location to allow/encourage other tools to implement it as a standard?
I've never used entry points myself (other than the console script entry points supported by packaging) but a quick Google search found http://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery... as the only obvious candidate for documentation (and a bit later I thought of looking under pkg_resources and found http://setuptools.readthedocs.io/en/latest/pkg_resources.html#entry-points). This doesn't really say how the entry point data is stored in the project metadata, so it's not clear how I'd read that data in my own code (the answer is of course to use pkg_resources, but the point of documenting it as a standard is to allow alternative implementations).
I have in fact made an alternative implementation (PyPI package entrypoints) by 'reverse engineering' the format. A simple text-based format doesn't really justify the term 'reverse engineering', but for instance it wasn't obvious to me that the names were case sensitive, whereas Python's standard config parser treats keys as case-insensitive.
Daniel:
http://setuptools.readthedocs.io/en/latest/formats.html?highlight=entry_poin...
Thanks, this link is closer than any I found to a specification. There are docs on how to create entry points in setup.py and how to use them with pkg_resources, but that's the only bit I've seen that describes the interchange file format.
Agreed, I hadn't found that, either.
I think we can probably expand on it a bit, though! I'll try to put together something for packaging.python.org.
One thing that immediately strikes me is that the encoding of the file is unspecified... Paul
On Wed, Oct 18, 2017 at 2:18 PM Alex Grönholm <alex.gronholm@nextday.fi> wrote:
Daniel Holth kirjoitti 18.10.2017 klo 21:06:
http://setuptools.readthedocs.io/en/latest/formats.html?highlight=entry_poin...
http://setuptools.readthedocs.io/en/latest/pkg_resources.html?highlight=pkg_...
It is not very complicated. It looks like the characters are mostly 'python identifier' rules with a little bit of 'package name' rules.
I am also concerned about the amount of parsing on startup. A hard problem for certain, since no one likes outdated cache problems either. It is also unpleasant to have too much code with a runtime dependency on 'packaging'.
Wasn't someone working on implementing pkg_resources in the standard library at some point?
I'm just saying it is good to avoid importing it unless you really need to. Same reason we removed it from entry point script wrappers.
On Wed, Oct 18, 2017 at 2:57 PM Paul Moore <p.f.moore@gmail.com> wrote:
On 18 October 2017 at 19:42, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Wed, Oct 18, 2017, at 05:59 PM, Paul Moore wrote:
I've always used the setuptools documentation as a reference. Are you suggesting moving that information to a different location to allow/encourage other tools to implement it as a standard?
I've never used entry points myself (other than the console script entry points supported by packaging) but a quick Google search found
http://setuptools.readthedocs.io/en/latest/setuptools.html#dynamic-discovery...
as the only obvious candidate for documentation (and a bit later I thought of looking under pkg_resources and found
http://setuptools.readthedocs.io/en/latest/pkg_resources.html#entry-points ).
This doesn't really say how the entry point data is stored in the project metadata, so it's not clear how I'd read that data in my own code (the answer is of course to use pkg_resources, but the point of documenting it as a standard is to allow alternative implementations).
I have in fact made an alternative implementation (PyPI package entrypoints) by 'reverse engineering' the format. A simple text-based format doesn't really justify the term 'reverse engineering', but for instance it wasn't obvious to me that the names were case sensitive, whereas Python's standard config parser treats keys as case-insensitive.
Daniel:
http://setuptools.readthedocs.io/en/latest/formats.html?highlight=entry_poin...
Thanks, this link is closer than any I found to a specification. There are docs on how to create entry points in setup.py and how to use them with pkg_resources, but that's the only bit I've seen that describes the interchange file format.
Agreed, I hadn't found that, either.
I think we can probably expand on it a bit, though! I'll try to put together something for packaging.python.org.
One thing that immediately strikes me is that the encoding of the file is unspecified... Paul
Now that's an easy one to clear up, since there is only one worthwhile encoding.
On 19 October 2017 at 04:18, Alex Grönholm <alex.gronholm@nextday.fi> wrote:
Daniel Holth kirjoitti 18.10.2017 klo 21:06:
http://setuptools.readthedocs.io/en/latest/formats.html? highlight=entry_points.txt#entry-points-txt-entry-point-plugin-metadata
http://setuptools.readthedocs.io/en/latest/pkg_resources. html?highlight=pkg_resources#creating-and-parsing
It is not very complicated. It looks like the characters are mostly 'python identifier' rules with a little bit of 'package name' rules.
I am also concerned about the amount of parsing on startup. A hard problem for certain, since no one likes outdated cache problems either. It is also unpleasant to have too much code with a runtime dependency on 'packaging'.
Wasn't someone working on implementing pkg_resources in the standard library at some point?
The idea has been raised, but we've been hesitant for the same reason we're inclined to take distutils out: packaging APIs need to be free to evolve in line with packaging interoperability standards, rather than with the Python language definition. Barry Warsaw & Brett Cannon recently mentioned something to me about working on a potential runtime alternative to pkg_resources that could be installed without also installing setuptools, but I don't know any of the specifics (and I'm not sure either of them follows distutils-sig). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
We said "you won't have to install setuptools" but actually "you don't have to use it" is good enough. If you had 2 pkg-resources implementations running you might wind up scanning sys.path extra times... On Wed, Oct 18, 2017, 20:53 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 19 October 2017 at 04:18, Alex Grönholm <alex.gronholm@nextday.fi> wrote:
Daniel Holth kirjoitti 18.10.2017 klo 21:06:
http://setuptools.readthedocs.io/en/latest/formats.html?highlight=entry_poin...
http://setuptools.readthedocs.io/en/latest/pkg_resources.html?highlight=pkg_...
It is not very complicated. It looks like the characters are mostly 'python identifier' rules with a little bit of 'package name' rules.
I am also concerned about the amount of parsing on startup. A hard problem for certain, since no one likes outdated cache problems either. It is also unpleasant to have too much code with a runtime dependency on 'packaging'.
Wasn't someone working on implementing pkg_resources in the standard library at some point?
The idea has been raised, but we've been hesitant for the same reason we're inclined to take distutils out: packaging APIs need to be free to evolve in line with packaging interoperability standards, rather than with the Python language definition.
Barry Warsaw & Brett Cannon recently mentioned something to me about working on a potential runtime alternative to pkg_resources that could be installed without also installing setuptools, but I don't know any of the specifics (and I'm not sure either of them follows distutils-sig).
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On 19 October 2017 at 12:16, Daniel Holth <dholth@gmail.com> wrote:
We said "you won't have to install setuptools" but actually "you don't have to use it" is good enough. If you had 2 pkg-resources implementations running you might wind up scanning sys.path extra times...
True, but that's where Thomas's suggestion of attempting to define a standardised caching convention comes in: right now, there's no middle ground between "you must use pkg_resources" and "every helper library must scan for the raw entry-point metadata itself". If there's a defined common caching mechanism, and support for it is added to new versions of pkg_resources, then the design constraint becomes "If you end up using multiple entry-point scanners, you'll want a recent setuptools/pkg_resource, so you don't waste too much time on repeated metadata scans". Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
def get_env_json_path(): directory = $VIRTUAL_ENV || ? return os.path.join(directory, ENV_JSON_FILENAME) def on_install(pkg_json): env_json_path = get_env_json_path() env_json = json.load(env_json_path) env_json['pkgs’][pkgname] = pkg_json with open(env_json_path, 'w') as f: f.write(env_json) def read_cached_entry_points(): env_json_path = get_env_json_path() env_json = json.load(env_json_path) entry_points = flatten(**{ pkg['entry_points'] for pkg in env_json['pigs']}) return entry_points Would this introduce a need for a new and confusing rescan_metadata() (pkg.on_install() for pkg in pkgs)? On Wednesday, October 18, 2017, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 19 October 2017 at 12:16, Daniel Holth <dholth@gmail.com <javascript:_e(%7B%7D,'cvml','dholth@gmail.com');>> wrote:
We said "you won't have to install setuptools" but actually "you don't have to use it" is good enough. If you had 2 pkg-resources implementations running you might wind up scanning sys.path extra times...
True, but that's where Thomas's suggestion of attempting to define a standardised caching convention comes in: right now, there's no middle ground between "you must use pkg_resources" and "every helper library must scan for the raw entry-point metadata itself".
If there's a defined common caching mechanism, and support for it is added to new versions of pkg_resources, then the design constraint becomes "If you end up using multiple entry-point scanners, you'll want a recent setuptools/pkg_resource, so you don't waste too much time on repeated metadata scans".
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com <javascript:_e(%7B%7D,'cvml','ncoghlan@gmail.com');> | Brisbane, Australia
On Oct 18, 2017, at 10:52 AM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
1. Specification
I’m in favor, although one question I guess is whether it should be a a PEP or an ad hoc spec. Given (2) it should *probably* be a a PEP (since without (2), its just another file in the .dist-info directory and that doesn’t actually need standardized at all). I don’t think that this will be a very controversial PEP though, and should be pretty easy.
2. Caching
I’m also in favor of this. Although I would suggest SQLite rather than a JSON file for the primary reason being that a JSON file isn’t multiprocess safe without being careful (and possibly introducing locking) whereas SQLite has already solved that problem. One possible further enhancement to your proposal is to try and think of a way to have a singular cache, since we can include the sys.path entry as part of the data inside the cache, having a singular cache means we can reduce the the number of files we have to open down to a single file. The biggest problem I see with this, is it opens up questions about how we handle things like user installs… so maybe a cache DB per sys.path entry is the best way. I think we could use something like SQLite’s ATTACH DATABASE command to add multiple DBs to the same SQLite connection to be able to query across all of the entries with a single query. One downside to this is that SQLite is an optional module in Python so it may not exist, although we could implement that so that we just bypass the cache always in that case (and probably raise a warning?) so things continue to work, they will just be slower. I know that Twisted has used a cache file for awhile for plugins (so a similiar use case) so I wonder if they would have any opinions or insight into this as well.
On Thu, Oct 19, 2017, at 04:10 PM, Donald Stufft wrote:
I’m in favor, although one question I guess is whether it should be a a PEP or an ad hoc spec. Given (2) it should *probably* be a a PEP (since without (2), its just another file in the .dist-info directory and that doesn’t actually need standardized at all). I don’t think that this will be a very controversial PEP though, and should be pretty easy.
I have opened a PR to document what is already there, without adding any new features. I think this is worth doing even if we don't change anything, since it's a de-facto standard used for different tools to interact. https://github.com/pypa/python-packaging-user-guide/pull/390 We can still write a PEP for caching if necessary.
I’m also in favor of this. Although I would suggest SQLite rather than a JSON file for the primary reason being that a JSON file isn’t multiprocess safe without being careful (and possibly introducing locking) whereas SQLite has already solved that problem.
SQLite was actually my first thought, but from experience in Jupyter & IPython I'm wary of it - its built-in locking does not work well over NFS, and it's easy to corrupt the database. I think careful use of atomic writing can be more reliable (though that has given us some problems too). That may be easier if there's one cache per user, though - we can perhaps try to store it somewhere that's not NFS. Thomas
On Oct 19, 2017, at 12:14 PM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Thu, Oct 19, 2017, at 04:10 PM, Donald Stufft wrote:
I’m in favor, although one question I guess is whether it should be a a PEP or an ad hoc spec. Given (2) it should *probably* be a a PEP (since without (2), its just another file in the .dist-info directory and that doesn’t actually need standardized at all). I don’t think that this will be a very controversial PEP though, and should be pretty easy.
I have opened a PR to document what is already there, without adding any new features. I think this is worth doing even if we don't change anything, since it's a de-facto standard used for different tools to interact.
https://github.com/pypa/python-packaging-user-guide/pull/390
We can still write a PEP for caching if necessary.
I think documenting what’s there is a reasonable goal, but if we’re going to add caching we should just PEP the whole thing changing it from a defect standard to an actual standard + caching. Generally we should only use non-PEP “specs” in places where we’re just trying to document what exists already, but where we’re not really happy with the current solution or we plan to alter it eventually. For this, I think the entry points solution is generally a good one with some alterations (namely, the addition of caching)…. Although now that I think about it, maybe this isn’t really a packaging problem at all and I’m not sure that it benefits from standardization at all. So stepping back a second, here’s what entrypoints provides today: 1. A way to implement a interface that some other package can provide implementations for. 2. A way to specify script wrappers that will be automatically generated. 3. A way to define extras that must be installed in order for a particular entry point to be available. Off the bat I’m going to say we don’t need to worry about (2) in this hypothetical system, because I think the fact it is implemented currently via this system is mostly a historic accident, and it’s not something we should be looking at in the future. Script wrappers should have some dedicated metadata, not piggybacking off of the plugin system. For (3) I don’t believe that what extras were installed is recorded anywhere, so I’m going to guess that this works by looking up what extras are *available* for a particular package and then seeing if all of the requirements of that distribution are satisfied. Assuming that’s the case then that’s not really something that requires deep integration with the packaging toolchain, it just needs the APIs to look those things up. Finally we come to (1), which is in my opinion the meet of what you’re hoping to achieve here (and what most people are using entry points for outside of console scripts. What I notice about (1) is that it really has absolutely nothing to do with packaging at all. It would likely use some of the APIs provided by the packaging toolchain (for instance, the ability to add custom files to a .dist-info directory, the ability to iterate over installed packages, etc) but as a whole pip, setuptools, twine, PyPI, etc none of these things need to know anything about it. EXCEPT, for the fact that with the desire to cache things, it would be beneficial to “hook” into the lifecycle of a package install. However I know that there are other plugin systems out there that would like to also be able to do that (Twisted Plugins come to mind) and that I think outside of plugin systems, such a mechanism is likely to be useful in general for other cases. So heres a different idea that is a bit more ambitious but that I think is a better overall idea. Let entrypoints be a setuptools thing, and lets define some key lifecycle hooks during the installation of a package and some mechanism in the metadata to let other tools subscribe to those hooks. Then a caching layer could be written for setuptools entrypoints to make that faster without requiring standardization, but also a whole new, better plugin system could to, Twisted plugins could benefit, etc [1]. One thing that I like about all of our work recently in packaging is a lot of it has been about making it so there isn’t just one standard set of tools, and I think that providing lifecycle hooks is another step along that path.
I’m also in favor of this. Although I would suggest SQLite rather than a JSON file for the primary reason being that a JSON file isn’t multiprocess safe without being careful (and possibly introducing locking) whereas SQLite has already solved that problem.
SQLite was actually my first thought, but from experience in Jupyter & IPython I'm wary of it - its built-in locking does not work well over NFS, and it's easy to corrupt the database. I think careful use of atomic writing can be more reliable (though that has given us some problems too).
That may be easier if there's one cache per user, though - we can perhaps try to store it somewhere that's not NFS.
I don’t have a lot of experience using SQLite in this way so it’s entirely possible it’s not as robust as we want/need it to be. I’m not wedded to this idea (but then if we do what I said above, this idea becomes something for any individual implementation of plugins to decide and we don’t need to pick a standard here at all!). [1] I realize the irony in saying a plugin system isn’t a packaging problem, so let’s define a plugin system for packaging hooks, but I think it can be very simple and not something designed to be reusable outside of that context and speed is less of a concern, etc.
On 19 October 2017 at 19:09, Donald Stufft <donald@stufft.io> wrote:
So heres a different idea that is a bit more ambitious but that I think is a better overall idea. Let entrypoints be a setuptools thing, and lets define some key lifecycle hooks during the installation of a package and some mechanism in the metadata to let other tools subscribe to those hooks. Then a caching layer could be written for setuptools entrypoints to make that faster without requiring standardization, but also a whole new, better plugin system could to, Twisted plugins could benefit, etc [1].
I think this is a nice idea, and like you say could likely enable a number of interesting use cases. However...
One thing that I like about all of our work recently in packaging is a lot of it has been about making it so there isn’t just one standard set of tools, and I think that providing lifecycle hooks is another step along that path.
While I agree with this, one thing I have noticed with recent work is that standardising existing things has typically been relatively painless and stress-free. But designing new mechanisms generally ends up with huge threads, heated debates, and people burning out on the whole thing. We've had a couple of cases of that recently, and in particular Thomas has endured the big PEP 517 debate, so I'm inclined to say we should take a rest from new designs for a while, and keep the scope here limited. We can go back and hit packaging system hooks later, it's not like the idea will go away. And the breathing space will also give people time to actually implement the recent PEPs, and consolidate the gains we've already made. Paul
I prefer a single more generic mechanism that packaging happens to use instead of making special mechanisms for scripts or other callables that packaging might some day be interested in. One API, I can type pkg_resources.iter_entry_points('console_scripts') to enumerate the scripts and perhaps invoke them without the wrappers, or I can look other plugins. +1 on simply documenting what we have first. How long does pkg_resources take to import for you folks? On Thu, Oct 19, 2017 at 2:10 PM Donald Stufft <donald@stufft.io> wrote:
On Oct 19, 2017, at 12:14 PM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Thu, Oct 19, 2017, at 04:10 PM, Donald Stufft wrote:
I’m in favor, although one question I guess is whether it should be a a PEP or an ad hoc spec. Given (2) it should *probably* be a a PEP (since without (2), its just another file in the .dist-info directory and that doesn’t actually need standardized at all). I don’t think that this will be a very controversial PEP though, and should be pretty easy.
I have opened a PR to document what is already there, without adding any new features. I think this is worth doing even if we don't change anything, since it's a de-facto standard used for different tools to interact.
https://github.com/pypa/python-packaging-user-guide/pull/390
We can still write a PEP for caching if necessary.
I think documenting what’s there is a reasonable goal, but if we’re going to add caching we should just PEP the whole thing changing it from a defect standard to an actual standard + caching. Generally we should only use non-PEP “specs” in places where we’re just trying to document what exists already, but where we’re not really happy with the current solution or we plan to alter it eventually.
For this, I think the entry points solution is generally a good one with some alterations (namely, the addition of caching)…. Although now that I think about it, maybe this isn’t really a packaging problem at all and I’m not sure that it benefits from standardization at all.
So stepping back a second, here’s what entrypoints provides today:
1. A way to implement a interface that some other package can provide implementations for. 2. A way to specify script wrappers that will be automatically generated. 3. A way to define extras that must be installed in order for a particular entry point to be available.
Off the bat I’m going to say we don’t need to worry about (2) in this hypothetical system, because I think the fact it is implemented currently via this system is mostly a historic accident, and it’s not something we should be looking at in the future. Script wrappers should have some dedicated metadata, not piggybacking off of the plugin system.
For (3) I don’t believe that what extras were installed is recorded anywhere, so I’m going to guess that this works by looking up what extras are *available* for a particular package and then seeing if all of the requirements of that distribution are satisfied. Assuming that’s the case then that’s not really something that requires deep integration with the packaging toolchain, it just needs the APIs to look those things up.
Finally we come to (1), which is in my opinion the meet of what you’re hoping to achieve here (and what most people are using entry points for outside of console scripts. What I notice about (1) is that it really has absolutely nothing to do with packaging at all. It would likely use some of the APIs provided by the packaging toolchain (for instance, the ability to add custom files to a .dist-info directory, the ability to iterate over installed packages, etc) but as a whole pip, setuptools, twine, PyPI, etc none of these things need to know anything about it.
EXCEPT, for the fact that with the desire to cache things, it would be beneficial to “hook” into the lifecycle of a package install. However I know that there are other plugin systems out there that would like to also be able to do that (Twisted Plugins come to mind) and that I think outside of plugin systems, such a mechanism is likely to be useful in general for other cases.
So heres a different idea that is a bit more ambitious but that I think is a better overall idea. Let entrypoints be a setuptools thing, and lets define some key lifecycle hooks during the installation of a package and some mechanism in the metadata to let other tools subscribe to those hooks. Then a caching layer could be written for setuptools entrypoints to make that faster without requiring standardization, but also a whole new, better plugin system could to, Twisted plugins could benefit, etc [1].
One thing that I like about all of our work recently in packaging is a lot of it has been about making it so there isn’t just one standard set of tools, and I think that providing lifecycle hooks is another step along that path.
I’m also in favor of this. Although I would suggest SQLite rather than a JSON file for the primary reason being that a JSON file isn’t multiprocess safe without being careful (and possibly introducing locking) whereas SQLite has already solved that problem.
SQLite was actually my first thought, but from experience in Jupyter & IPython I'm wary of it - its built-in locking does not work well over NFS, and it's easy to corrupt the database. I think careful use of atomic writing can be more reliable (though that has given us some problems too).
That may be easier if there's one cache per user, though - we can perhaps try to store it somewhere that's not NFS.
I don’t have a lot of experience using SQLite in this way so it’s entirely possible it’s not as robust as we want/need it to be. I’m not wedded to this idea (but then if we do what I said above, this idea becomes something for any individual implementation of plugins to decide and we don’t need to pick a standard here at all!).
[1] I realize the irony in saying a plugin system isn’t a packaging problem, so let’s define a plugin system for packaging hooks, but I think it can be very simple and not something designed to be reusable outside of that context and speed is less of a concern, etc.
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On Oct 19, 2017, at 2:28 PM, Paul Moore <p.f.moore@gmail.com> wrote:
While I agree with this, one thing I have noticed with recent work is that standardising existing things has typically been relatively painless and stress-free. But designing new mechanisms generally ends up with huge threads, heated debates, and people burning out on the whole thing. We've had a couple of cases of that recently, and in particular Thomas has endured the big PEP 517 debate, so I'm inclined to say we should take a rest from new designs for a while, and keep the scope here limited.
So I’m generally fine with keeping the scope limited, but for the same reason as I think the real solution is what I defined above, I think this isn’t/shouldn’t be a packaging standard and is a setuptools feature and should be documented/live there. If setuptools wants to enable people to directly manipulate those files they can document the standard of those files, if they want to treat it as internal and you’re expected to use their APIs then they can. Essentially, I don’t think that a plugin system should be within the domain of distutils-sig or the PyPA and the only reason we’re even thinking of it as one is because (a) historically setuptools _had_ a plugin system and (b) we lack lifecycle hooks. I’m loathe to move the documentation for a setuptools specific feature out of their documentation because I think it muddies the water further.
On Thu, Oct 19, 2017, at 07:09 PM, Donald Stufft wrote:
So heres a different idea that is a bit more ambitious but that I think is a better overall idea. Let entrypoints be a setuptools thing, and lets define some key lifecycle hooks during the installation of a package and some mechanism in the metadata to let other tools subscribe to those hooks.
I'd like to document the existing mechanism as previously suggested. Not least because I've already written the PR ;-). I don't think this needs to be controversial. They are a de-facto packaging standard, whether or not that's theoretically necessary. There's more than one tool that can create them (setuptools, flit), and more than one that can consume them (pkg_resources, entrypoints). Lots of packages use them, and they're not going anywhere soon. Describing the format properly seems like a clear win. For caching, I'm happy enough to work on a more general PEP to define packaging hooks, so long as that isn't going to be as long a discussion as PEP 517. Daniel:
How long does pkg_resources take to import for you folks?
About 0.5s on my laptop with an SSD, about 5s on a machine with a spinning hard drive. This is simulating a cold start on both; it's much quicker once the OS caches it in memory. Thomas
On Oct 19, 2017, at 2:54 PM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
I don't think this needs to be controversial. They are a de-facto packaging standard, whether or not that's theoretically necessary. There's more than one tool that can create them (setuptools, flit), and more than one that can consume them (pkg_resources, entrypoints). Lots of packages use them, and they're not going anywhere soon. Describing the format properly seems like a clear win.
I disagree they are a packaging standard and I think it would be crummy to define it as one. I believe it is a setuptools feature, that flit and entrypoints wants to integrate with a setuptools feature is fine, but that doesn’t make it a packaging standard just because it came from setuptools. I agree that describing the format properly is a clear win, but I believe it belongs in the setuptools documentation.
On Thu, Oct 19, 2017, at 08:01 PM, Donald Stufft wrote:
On Oct 19, 2017, at 2:54 PM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:>> I don't think this needs to be controversial. They are a de-facto packaging standard, whether or not that's theoretically necessary. There's more than one tool that can create them (setuptools, flit), and>> more than one that can consume them (pkg_resources, entrypoints). Lots>> of packages use them, and they're not going anywhere soon. Describing>> the format properly seems like a clear win.
I disagree they are a packaging standard and I think it would be crummy to define it as one. I believe it is a setuptools feature, that flit and entrypoints wants to integrate with a setuptools feature is fine, but that doesn’t make it a packaging standard just because it came from setuptools. I agree that describing the format properly is a clear win, but I believe it belongs in the setuptools documentation.
pip and distlib also independently read this format without going through setuptools. It's a de-facto standard already. Entry points are also the most common way for packages to install command-line scripts, and the most effective way to do so across different platforms. So it's essential that install tools do understand this. Much of our packaging standards were built out of setuptools features anyway - why pretend that this is different?
On Oct 19, 2017, at 3:15 PM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Thu, Oct 19, 2017, at 08:01 PM, Donald Stufft wrote:
On Oct 19, 2017, at 2:54 PM, Thomas Kluyver <thomas@kluyver.me.uk <mailto:thomas@kluyver.me.uk>> wrote:
I don't think this needs to be controversial. They are a de-facto packaging standard, whether or not that's theoretically necessary. There's more than one tool that can create them (setuptools, flit), and more than one that can consume them (pkg_resources, entrypoints). Lots of packages use them, and they're not going anywhere soon. Describing the format properly seems like a clear win.
I disagree they are a packaging standard and I think it would be crummy to define it as one. I believe it is a setuptools feature, that flit and entrypoints wants to integrate with a setuptools feature is fine, but that doesn’t make it a packaging standard just because it came from setuptools. I agree that describing the format properly is a clear win, but I believe it belongs in the setuptools documentation.
pip and distlib also independently read this format without going through setuptools. It's a de-facto standard already. Entry points are also the most common way for packages to install command-line scripts, and the most effective way to do so across different platforms. So it's essential that install tools do understand this.
It’s only essential in that we support a very limited subset specifically for console scripts, which long term we should be extracting from entry points and using something dedicated to that. Generating script wrappers is a packaging concern, and if this proposal was about documenting the console_scripts key in an entry_points.txt file to trigger a console script being generated, then that’s fine with me.
Much of our packaging standards were built out of setuptools features anyway - why pretend that this is different?
Because it is? A generic plugin mechanism is not a packaging feature any more then a HTTP client is a packaging feature, but setuptools contains one of those too. Since setuptools was in large part a packaging library, it will of course contain many packaging features that we’re going to standardize on, but something being in setuptools does not in fact make it a packaging feature in and of itself. As an example of another setuptools feature that isn’t a packaging feature, I also would be against adding the resource APIs in a packaging standard because they’re not a packaging feature either, they’re a python import module feature (which is why Brett Cannon and Barry are adding them to importlib instead of trying to make a packaging PEP for them).
On Thu, Oct 19, 2017, at 08:29 PM, Donald Stufft wrote:
Because it is? A generic plugin mechanism is not a packaging feature any more then a HTTP client is a packaging feature, but setuptools contains one of those too. Since setuptools was in large part a packaging library, it will of course contain many packaging features that we’re going to standardize on, but something being in setuptools does not in fact make it a packaging feature in and of itself. My argument is not that it's in setuptools, it's that
1. It's already processed by multiple packaging tools 2. Any tool producing wheels which include command line tools basically has to use entry points (or include a bunch of redundant complexity to make command-line wrappers). It's a de-facto part of the wheel spec, at least until a replacement is devised - and since it works, replacing for semantic cleanliness is not a priority. You're quite right that a plugin system doesn't need to be a packaging standard. But that ship has sailed. It's already a standard format for packaging, the only question is whether it's documented. Practicality beats purity. Thomas
On Oct 19, 2017, at 3:55 PM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Thu, Oct 19, 2017, at 08:29 PM, Donald Stufft wrote:
Because it is? A generic plugin mechanism is not a packaging feature any more then a HTTP client is a packaging feature, but setuptools contains one of those too. Since setuptools was in large part a packaging library, it will of course contain many packaging features that we’re going to standardize on, but something being in setuptools does not in fact make it a packaging feature in and of itself.
My argument is not that it's in setuptools, it's that
1. It's already processed by multiple packaging tools 2. Any tool producing wheels which include command line tools basically has to use entry points (or include a bunch of redundant complexity to make command-line wrappers). It's a de-facto part of the wheel spec, at least until a replacement is devised - and since it works, replacing for semantic cleanliness is not a priority.
You're quite right that a plugin system doesn't need to be a packaging standard. But that ship has sailed. It's already a standard format for packaging, the only question is whether it's documented. Practicality beats purity.
Like I said, I’m perfectly fine documenting that if you add an entry_points.txt to the .dist-info directory, that is an INI file that contains a section named “console_scripts” and define what is valid inside of the console_scripts section that it will generate script wrappers, then fine. But we should leave any other section in this entry_points.txt file as undefined in packaging terms, and point people towards setuptools for more information about it if they want to know anything more than what we need for packaging. I am against fully speccing out or adding more features to entry points as part of a packaging standardization effort.
On Oct 19, 2017, at 4:04 PM, Donald Stufft <donald@stufft.io> wrote:
Like I said, I’m perfectly fine documenting that if you add an entry_points.txt to the .dist-info directory, that is an INI file that contains a section named “console_scripts” and define what is valid inside of the console_scripts section that it will generate script wrappers, then fine. But we should leave any other section in this entry_points.txt file as undefined in packaging terms, and point people towards setuptools for more information about it if they want to know anything more than what we need for packaging.
To be more specific here, the hypothetical thing we would be documenting/standardizing here is console entry points and script wrappers, not a generic plugin system. So console scripts would be the focus of the documentation.
On Thu, Oct 19, 2017, at 09:04 PM, Donald Stufft wrote:
Like I said, I’m perfectly fine documenting that if you add an entry_points.txt to the .dist-info directory, that is an INI file that contains a section named “console_scripts” and define what is valid inside of the console_scripts section that it will generate script wrappers, then fine. But we should leave any other section in this entry_points.txt file as undefined in packaging terms, and point people towards setuptools for more information about it if they want to know anything more than what we need for packaging.
I don't see any advantage in describing the file format but then pretending that there's only section in it. We're not prescribing any particular meaning or use for other sections, but it seems bizarre to not describe the possibilities. console_scripts is just one use case. Also, entry points in general kind of are a packaging thing. You specify them in packaging metadata, both for setuptools and flit, and the packaging tools write entry_points.txt. It's not the only way to create a plugin system, but it's the way this one was created. I honestly don't get the resistance to documenting this as a whole. I'm not proposing something that will add a new maintenance burden; it's a description of something that's already there. Can't we save the energy for discussing a real change or new thing? Thomas
On Oct 19, 2017, at 4:36 PM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Thu, Oct 19, 2017, at 09:04 PM, Donald Stufft wrote:
Like I said, I’m perfectly fine documenting that if you add an entry_points.txt to the .dist-info directory, that is an INI file that contains a section named “console_scripts” and define what is valid inside of the console_scripts section that it will generate script wrappers, then fine. But we should leave any other section in this entry_points.txt file as undefined in packaging terms, and point people towards setuptools for more information about it if they want to know anything more than what we need for packaging.
I don't see any advantage in describing the file format but then pretending that there's only section in it. We're not prescribing any particular meaning or use for other sections, but it seems bizarre to not describe the possibilities. console_scripts is just one use case.
Because the feature is unrelated to packaging other than the fact we currently utilize it for console_scripts. A spec to standardize console_scripts is a good thing, a spec to standardize an almost entirely unrelated feature for packaging is a bad thing.
Also, entry points in general kind of are a packaging thing. You specify them in packaging metadata, both for setuptools and flit, and the packaging tools write entry_points.txt. It's not the only way to create a plugin system, but it's the way this one was created.
You can describe lots of things in the packaging metadata, because one of the features of the packaging metadata is you can add arbitrary files to the dist-info directory. Entrypoints are one such file that some projects add to that directory, but there are other examples and jsut becuause it involves adding files to that, does not mean it belongs to “packaging”.
I honestly don't get the resistance to documenting this as a whole. I'm not proposing something that will add a new maintenance burden; it's a description of something that's already there. Can't we save the energy for discussing a real change or new thing?
I don’t get the resistance to documenting this where it belongs. Its not any more difficult to document things in the setuptools repository than it is to document it in the packaging specs repository.
On 10/19/2017 04:57 PM, Donald Stufft wrote:
Because the feature is unrelated to packaging other than the fact we currently utilize it for console_scripts. That seems like an odd perspective. Console scripts may be the only bit of entry points which is used *by the packaging system* at installation time, but an system composed of separately-installable packages providing shared services needs some way of querying those services at runtime, which is what all the *other* uses of entry points represent. Having the packaging system register those services at installation time (even if it doesn't care otherwise about them) seems pretty reasonable to me.
Tres. -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com
On Oct 19, 2017, at 5:26 PM, Tres Seaver <tseaver@palladion.com> wrote:
Having the packaging system register those services at installation time (even if it doesn't care otherwise about them) seems pretty reasonable to me.
It does register them at installation time, using an entirely generic feature of “you can add any file you want to a dist-info directory and we will preserve it”. It doesn’t need to know anything else about them other then it’s a file that needs preserved.
On Thursday, October 19, 2017, Donald Stufft <donald@stufft.io> wrote:
On Oct 19, 2017, at 5:26 PM, Tres Seaver <tseaver@palladion.com <javascript:_e(%7B%7D,'cvml','tseaver@palladion.com');>> wrote:
Having the packaging system register those services at installation time (even if it doesn't care otherwise about them) seems pretty reasonable to me.
It does register them at installation time, using an entirely generic feature of “you can add any file you want to a dist-info directory and we will preserve it”. It doesn’t need to know anything else about them other then it’s a file that needs preserved.
When I think of 'register at installation time', I think of adding them to a single { locked JSON || SQLite DB || ...}; because that's the only way there'd be a performance advantage? Why would we write a .txt, transform it to {JSON || SQL INSERTS}, and then write it to a central registrar? (BTW, pipsi does console script entry points with isolated virtualenvs linked into from ~/.local/bin (which is generally user-writable)).
On 20 October 2017 at 02:14, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Thu, Oct 19, 2017, at 04:10 PM, Donald Stufft wrote:
I’m in favor, although one question I guess is whether it should be a a PEP or an ad hoc spec. Given (2) it should *probably* be a a PEP (since without (2), its just another file in the .dist-info directory and that doesn’t actually need standardized at all). I don’t think that this will be a very controversial PEP though, and should be pretty easy.
I have opened a PR to document what is already there, without adding any new features. I think this is worth doing even if we don't change anything, since it's a de-facto standard used for different tools to interact.
https://github.com/pypa/python-packaging-user-guide/pull/390
We can still write a PEP for caching if necessary.
+1 for that approach (PR for the status quo, PEP for a shared metadata caching design) from me Making the status quo more discoverable is valuable in its own right, and the only decisions we'll need to make for that are terminology clarification ones, not interoperability ones (this isn't like PEP 440 or 508 where we actually thought some of the default setuptools behaviour was slightly incorrect and wanted to change it). Figuring out a robust cross-platform network-file-system-tolerant metadata caching design on the other hand is going to be hard, and as Donald suggests, the right ecosystem level solution might be to define install-time hooks for package installation operations.
I’m also in favor of this. Although I would suggest SQLite rather than a JSON file for the primary reason being that a JSON file isn’t multiprocess safe without being careful (and possibly introducing locking) whereas SQLite has already solved that problem.
SQLite was actually my first thought, but from experience in Jupyter & IPython I'm wary of it - its built-in locking does not work well over NFS, and it's easy to corrupt the database. I think careful use of atomic writing can be more reliable (though that has given us some problems too).
That may be easier if there's one cache per user, though - we can perhaps try to store it somewhere that's not NFS.
I'm wondering if rather than jumping straight to a PEP, it may make sense to instead initially pursue this idea as a *non-*standard, implementation dependent thing specific to the "entrypoints" project. There are a *lot* of challenges to be taken into account for a truly universal metadata caching design, and it would be easy to fall into the trap of coming up with a design so complex that nobody can realistically implement it. Specifically, I'm thinking of a usage model along the lines of the updatedb/locate pair on *nix systems: `locate` gives you access to very fast searches of your filesystem, but it *doesn't* try to automagically keeps its indexes up to date. Instead, refreshing the indexes is handled by `updatedb`, and you can either rely on that being run automatically in a cron job, or else force an update with `sudo updatedb` when you want to use `locate`. For a project like entrypoints, what that might look like is that at *runtime*, you may implement a reasonably fast "cache freshness check", where you scanned the mtime of all the sys.path entries, and compared those to the mtime of the cache. If the cache looks up to date, then cool, otherwise emit a warning about the stale metadata cache, and then bypass it. The entrypoints project itself could then expose a `refresh-entrypoints-cache` command that could start out only supporting virtual environments, and then extend to per-user caching, and then finally (maybe) consider whether or not it wanted to support installation-wide caches (with the extra permissions management and cross-process and cross-system coordination that may imply). Such an approach would also tie in nicely with Donald's suggestion of reframing the ecosystem level question as "How should the entrypoints project request that 'refresh-entrypoints-cache' be run after every package installation or removal operation?", which in turn would integrate nicely with things like RPM file triggers (where the system `pip` package could set a file trigger that arranged for any properly registered Python package installation plugins to be run for every modification to site-packages while still appropriately managing the risk of running arbitrary code with elevated privileges) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 20 October 2017 at 06:34, Donald Stufft <donald@stufft.io> wrote:
On Oct 19, 2017, at 4:04 PM, Donald Stufft <donald@stufft.io> wrote:
Like I said, I’m perfectly fine documenting that if you add an entry_points.txt to the .dist-info directory, that is an INI file that contains a section named “console_scripts” and define what is valid inside of the console_scripts section that it will generate script wrappers, then fine. But we should leave any other section in this entry_points.txt file as undefined in packaging terms, and point people towards setuptools for more information about it if they want to know anything more than what we need for packaging.
To be more specific here, the hypothetical thing we would be documenting/standardizing here is console entry points and script wrappers, not a generic plugin system. So console scripts would be the focus of the documentation.
We've already effectively blessed console_scripts as a standard approach: https://packaging.python.org/tutorials/distributing-packages/#entry-points The specific problem that blessing creates is that we currently only define: - a way for publishers to specify console_scripts via setuptools - a way for installers to find console_scripts using pkg_resources That's *very* similar to the problem we had with dependency declarations: only setuptools knew how to write them, and only easy_install knew how to read them. Beyond the specific example of console_scripts, there are also multiple subecosystems where both publishers and subscribers remain locked into the setuptools/pkg_resources combination because they use entry points for their plugin management. This means that if you want to write a pytest plugin, for example, the only officially supported way to do so is to use setuptools in order to publish the relevant entry point definitions: https://docs.pytest.org/en/latest/writing_plugins.html#setuptools-entry-poin... If we want to enable pytest plugin authors to use other build systems like flit, then those build systems need a defined interoperability format that's compatible with what pytest is expecting to see (i.e. entry point definitions that pkg_resources knows how to read). We ended up solving the previous tight publisher/installer coupling problem for dependency management *not* by coming up with completely new metadata formats, but rather by better specifying the ones that setuptools already knew how to emit, such that most publishers didn't need to change anything, and even when there were slight differences between the way setuptools worked and the agreed interoperability standards, other tools could readily translate setuptools output into the standardised form (e.g. egg_info -> PEP 376 dist-info directories and wheel metadata). The difference in this case is that: 1. entry_points.txt is already transported reliably through the whole packaging toolchain 2. It is the existing interoperability format for `console_scripts` definitions 3. Unlike setup.cfg & pyproject.toml, actual humans never touch it - it's written and read solely by software This means that the interoperability problems we actually care about solving (allowing non-setuptools based publishing tools to specify console_scripts and other pkg_resources entry points, and allowing non-pkg_resources based consumers to read pkg_resources entry point metadata, including console_scripts) can both be solved *just* by properly specifying the existing de facto format. So standardising on entry_points.txt isn't a matter of "because setuptools does it", it's because formalising it is the least-effort solution to what we actually want to enable: making setuptools optional on the publisher side (even if you need to publish entry point metadata), and making pkg_resources optional on the consumer side (even if you need to read entry point metadata). I do agree that the metadata caching problem is best tackled as a specific motivating example for supporting packaging installation and uninstallation hooks, but standardising the entry points format still helps us with that: it means we can just define "python.install_hooks" as a new entry point category, and spend our energy on defining the semantics and APIs of the hooks themselves, rather than having to worry about defining a new format for how publishers will declare how to run the hooks, or how installers will find out which hooks have been installed locally. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 20 October 2017 at 07:33, Donald Stufft <donald@stufft.io> wrote:
On Oct 19, 2017, at 5:26 PM, Tres Seaver <tseaver@palladion.com> wrote:
Having the packaging system register those services at installation time (even if it doesn't care otherwise about them) seems pretty reasonable to me.
It does register them at installation time, using an entirely generic feature of “you can add any file you want to a dist-info directory and we will preserve it”. It doesn’t need to know anything else about them other then it’s a file that needs preserved.
That's all the *installer* needs to know. Publishing tools like flit need to know the internal format in order to replicate the effect of https://packaging.python.org/tutorials/distributing-packages/#console-script... and to interoperate with any other pkg_resources based plugin ecosystem. I personally find it useful to think of entry points as a pub/sub communications channel between package authors and other runtime components. When you use the entry points syntax to declare a pytest plugin as a publisher, your intended subscriber is pytest, and pytest defines the possible messages. Ditto for any other entry points based plugin management system. Installers are mostly just a relay link in that pub/sub channel - they take the entry point announcement messages in the sdist or wheel archive, and pass them along to the installation database. The one exception to the "installers as passive relay" behaviour is that when you specify "console_scripts", your intended subscribers *are* package installation tools, and your message is "I'd like an executable wrapper for these entry points, please". Right now, the only documented publishing API for that pub/sub channel is setuptools.setup(), and the only documented subscription API is pkg_resources. Documenting the file format explicitly changes that dynamic, such that any publisher that produces a compliant `entry_points.txt` file will be supported by pkg_resources, and any consumer that can read a compliant `entry_points.txt` file will be supported by setuptools.setup() Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, Oct 20, 2017, at 05:42 AM, Nick Coghlan wrote:
I'm wondering if rather than jumping straight to a PEP, it may make sense to instead initially pursue this idea as a *non-*standard, implementation dependent thing specific to the "entrypoints" project. There are a *lot* of challenges to be taken into account for a truly universal metadata caching design, and it would be easy to fall into the trap of coming up with a design so complex that nobody can realistically implement it. I'd be happy to tackle it like that. Donald's proposed hooks for package installation and uninstallation would provide all the necessary interoperation between different tools. As and when it's working, the cache format can be documented for other consumers to use. Right now, the only documented publishing API for that pub/sub channel is setuptools.setup(), and the only documented subscription API is pkg_resources. Documenting the file format explicitly changes that dynamic, such that any publisher that produces a compliant `entry_points.txt` file will be supported by pkg_resources, and any consumer that can read a compliant `entry_points.txt` file will be supported by setuptools.setup() Yup, this is very much what I'd like :-)
Thanks, Thomas
I would also be happy to add a section to the document describing the specific use of entry points for defining scripts to install.
On Oct 20, 2017, at 1:32 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
If we want to enable pytest plugin authors to use other build systems like flit, then those build systems need a defined interoperability format that's compatible with what pytest is expecting to see (i.e. entry point definitions that pkg_resources knows how to read).
This is thinking about it wrong IMO. We could just as easily say if we want tools like flit to be able to package Twisted plugins then those build systems need a defined interoperability format that is compatible with what Twisted and that ecosystem is expecting.The *ONLY* reason we should care at all about defining entry points as a packaging feature is console scripts, so we should limit our standardization to that. PBR has a runtime feature too where it inserts metadata into the .dist-info directory at build time and then a runtime API that reads that.. should we standardize that too? I’m *not* saying that flit doesn’t nee to know how to generate entry points if a entry points using project wants to use flit, but what I am saying is that entry points isn’t a packaging specification. It’s a setuptools feature that should live within setuptools.
On 20 October 2017 at 16:43, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
I would also be happy to add a section to the document describing the specific use of entry points for defining scripts to install.
Yeah, it would make sense to include that, as well as reserving the "console_scripts" name on PyPI so we abide by our own "Only rely on a category name if you or one of your dependencies controls it on PyPI" guideline. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 20 October 2017 at 20:48, Donald Stufft <donald@stufft.io> wrote:
On Oct 20, 2017, at 1:32 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
If we want to enable pytest plugin authors to use other build systems like flit, then those build systems need a defined interoperability format that's compatible with what pytest is expecting to see (i.e. entry point definitions that pkg_resources knows how to read).
This is thinking about it wrong IMO.
We could just as easily say if we want tools like flit to be able to package Twisted plugins then those build systems need a defined interoperability format that is compatible with what Twisted and that ecosystem is expecting.
Twisted already defines plugin discovery in an inherently packaging-friendly way, since it's based on import names rather than packaging metadata. Other plugin management systems like straight.plugins are similar: they use Python's import system as their pub/sub channel to advertise plugin availability, and accept the limitation that this means all plugin APIs will be module level ones rather than being individual classes or callables.
The *ONLY* reason we should care at all about defining entry points as a packaging feature is console scripts, so we should limit our standardization to that. PBR has a runtime feature too where it inserts metadata into the .dist-info directory at build time and then a runtime API that reads that.. should we standardize that too?
No, because PBR isn't the defacto default build system that pip injects into setup.py execution by default. That's the one point where the "de facto standard" status of setuptools is relevant to the question of whether the entry_points.txt format is a PyPA interoperability standard: it is, because providing a functionally equivalent capability is required for publishers to be able to transparently switch from setuptools to something else without their end users noticing the difference. Sure we *could* say "We don't want to standardise on that one, we want to define a different one", but I think entry points are good enough for our purposes, so inventing something different wouldn't be a good use of anyone's time (see also: the perpetually deferred status of PEP 426). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Oct 20, 2017, at 7:02 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
That's the one point where the "de facto standard" status of setuptools is relevant to the question of whether the entry_points.txt format is a PyPA interoperability standard: it is, because providing a functionally equivalent capability is required for publishers to be able to transparently switch from setuptools to something else without their end users noticing the difference.
Nope. Because this isn’t a packaging feature. It’s a runtime feature of setuptools, and we do everyone a disservice by trying to move this into the purview of distutils-sig just because setuptools included a feature once. Just because setuptools included a feature does *NOT* make it a packaging related feature. Tell you what, I’ll drop everything today and write up a PEP that adds metadata for console scripts to the packaging metadata where it belongs, so we can move console_scripts entry point to a legacy footnote as far as packaging systems go. Then we can discuss whether an arbitrary plugin system is actually a packaging related spec (it’s not) on it’s own merits.
On Oct 20, 2017, at 1:32 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
3. Unlike setup.cfg & pyproject.toml, actual humans never touch it - it's written and read solely by software
This is wrong BTW, humans can and do effectively write entry_points.txt, it’s a supported feature of setuptools to do: setuptools.setup( entry_points=“”” [my_cool_entrypoint] athing = the.thing:bar “””, ) This is documented and I have run into a number of projects that do this.
On Fri, Oct 20, 2017, at 12:15 PM, Donald Stufft wrote:
Tell you what, I’ll drop everything today and write up a PEP...
Donald, why are you so determined that this spec should not be created? Your time is enormously valuable, so why would you drop everything to write a PEP which implies changes to tooling, simply so that we don't document the status quo? Even if we do make that change, there are thousands of existing packages using the existing de-facto standard, so it would still be valuable to document it. If it makes things easier, I'll host the spec on my own site and add a 'see also' from the specs page of the packaging user guide (because I think people would expect it to be there, even if it's not the 'right' place). But I don't think anyone else has expressed any objection to putting the spec there. Thomas
On Oct 20, 2017, at 7:31 AM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Fri, Oct 20, 2017, at 12:15 PM, Donald Stufft wrote:
Tell you what, I’ll drop everything today and write up a PEP...
Donald, why are you so determined that this spec should not be created? Your time is enormously valuable, so why would you drop everything to write a PEP which implies changes to tooling, simply so that we don't document the status quo? Even if we do make that change, there are thousands of existing packages using the existing de-facto standard, so it would still be valuable to document it.
If it makes things easier, I'll host the spec on my own site and add a 'see also' from the specs page of the packaging user guide (because I think people would expect it to be there, even if it's not the 'right' place). But I don't think anyone else has expressed any objection to putting the spec there.
Thomas
I mean, it’s a PEP I was already planning on writing at some point, because I’ve *never* liked the fact that our console script support was reliant on a setuptools feature so all I’d be doing is re-prioritizing work I was already planning on doing. I’m also completely happy with documenting the status quo, which from a packaging stand point means documenting console_scripts— it doesn’t mean pulling in an entire setuptools feature. I’m not even against documenting the entire feature, *if* it’s done inside of setuptools where it belongs. What I am against, is moving the entire entry points feature from a setuptools feature to a packaging standard. It is at best, tangental to packaging since outside of console_scripts it’s only real relation is that it uses features of the packaging ecosystem and happened to come from setuptools (but it could have just as easily been written externally to setuptools). Making it a packaging standard comes with several implications: * Since it is a packaging standard, then it is expected that all packaging tools will be updated to work with it. * We’re explicitly saying that this is the one true way of solving this problem in the Python ecosystem. * We stifle innovation (hell just including it in setutools at all does this, but we can’t unopen that can of worms). * We make it actively harder to improve the feature (since once it’s part of the purview of packaging standards, all of distutils-sig gets to weigh in on improvements). I don’t get why anyone would want to saddle all of the extra implications and work that comes with being a packaging standard on a feature that isn’t one and doesn’t need to be one. We are at our best when our efforts are on generalized mechanisms that allow features such as entry points to be implemented on top of us, rather than trying to pull in every tangential feature under the sun into our domain.
On Fri, Oct 20, 2017, at 12:50 PM, Donald Stufft wrote:
* Since it is a packaging standard, then it is expected that all packaging tools will be updated to work with it.
Where packaging tools need to know about it, they already have to. Where they don't, writing a standard doesn't imply that every tool has to implement it. Documenting it doesn't change either case, it just makes life easier for tools that do need to use it.
* We’re explicitly saying that this is the one true way of solving this problem in the Python ecosystem.
I don't buy that at all. We're saying that it exists, and this is what it is.
* We stifle innovation (hell just including it in setutools at all does this, but we can’t unopen that can of worms).
I don't think that's true to any significant extent. Having a standard does not stop people coming up with something better.
* We make it actively harder to improve the feature (since once it’s part of the purview of packaging standards, all of distutils-sig gets to weigh in on improvements).
It hasn't changed in years, as far as I know, and it's so widely used that any change is likely to break a load of stuff anyway. As we've already discussed for caching, we can improve by building *on top* of it relatively easily. And ultimately I think that bringing it out into daylight leads to a healthier future than leaving it under the stone marked 'setuptools''.
On 20 October 2017 at 21:15, Donald Stufft <donald@stufft.io> wrote:
Tell you what, I’ll drop everything today and write up a PEP that adds metadata for console scripts to the packaging metadata where it belongs,
Donald, you're making the same mistake I did with PEP 426: interoperability specifications are useless without a commitment from tooling developers to actually provide implementations that know how to read and write them. And since any new format you come up with won't be supported by existing pip and pkg_resources installations, there won't be any incentive for publishers to start using it, which means there's no incentives for runtime libraries to learn how to read it, etc, etc. In this case, we already have a perfectly serviceable format (entry_points.txt), a reference publisher (setuptools.setup) and a reference consumer (pkg_resources). The fact that the reference consumer is pkg_resources rather than pip doesn't suddenly take this outside the domain of responsibility of distutils-sig as a whole - it only takes it outside the domain of responsibility of PyPI. So if you want to say it is neither pip's nor PyPI's responsibility to say anything one way or the other about the entry points format (beyond whether or not they're used to declare console scripts in a way that pip understands), then I agree with you entirely. This spec isn't something you personally need to worry about, since it doesn't impact any of the tools you work on (aside from giving pip's existing console_scripts implementation a firmer foundation from an interoperability perpsective). So the core of our disagreement is whether or not interfaces involving pip and PyPI represent the limits of distutil-sig's responsibility. They don't, and that's reflected in the fact we have a split standing delegation from Guido (one initially to Richard Jones and later to you for changes that affect PyPI, and one to me for packaging ecosystem interoperability specifications in general)
so we can move console_scripts entry point to a legacy footnote as far as packaging systems go. Then we can discuss whether an arbitrary plugin system is actually a packaging related spec (it’s not) on it’s own merits.
Instructing publishing system developers on how to publish pkg_resources compatible entry points is indeed a Python packaging ecosystem level concern. Whether that capability survives into a hypothetical future metadata specification (whether that's PEP 426+459 or something else entirely) would then be a different question, but it isn't one we need to worry about right now (since it purely affects internal interoperability file formats that only automated scripts and folks maintaining those scripts need to care about, and we'd expect entry_points.txt and PKG-INFO to coexist alongside any new format for a *long* time). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Oct 20, 2017, at 7:57 AM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Fri, Oct 20, 2017, at 12:50 PM, Donald Stufft wrote:
* Since it is a packaging standard, then it is expected that all packaging tools will be updated to work with it.
Where packaging tools need to know about it, they already have to. Where they don't, writing a standard doesn't imply that every tool has to implement it. Documenting it doesn't change either case, it just makes life easier for tools that do need to use it.
Packaging tools shouldn’t be expected to know anything about it other than the console_scripts feature (which shouldn’t exist as an entry point, but currently does for historical reasons). Publishing tools should have a way for additional files that the publishing tool wasn’t even aware might exist someday to get added to the metadata directory, installation tools should preserve those files when installing them. With those two generic features, then entry points (and other things!) can be written on top of the ecosystem *without* needing to standardize on one solution for one particular non-packaging problem. If a publishing tool doesn’t want to provide that mechanism, then that is fine, but that limits their audience (in the same way that not building C extensions limits their audience, people who need that capability won’t be able to use them).
* We’re explicitly saying that this is the one true way of solving this problem in the Python ecosystem.
I don't buy that at all. We're saying that it exists, and this is what it is.
It’s literally the way all of our packaging standards are written. Don’t use eggs, wheels are the one true way, don’t use YOLO versions, PEP 440 is the one true way, don’t add arbitrary extensions to the simple repo format, PEP 503 API Is the one true way, etc etc etc.
* We stifle innovation (hell just including it in setutools at all does this, but we can’t unopen that can of worms).
I don't think that's true to any significant extent. Having a standard does not stop people coming up with something better.
It doesn’t actively prevent someone from coming up with something better no, but what it does do is add a pretty huge barrier to entry for someone who wanted to come up with something better. It’s the same way that something being added to the stdlib stifles competition. When something is “the standard”, it discourages people from even trying to make something better— or if they do make other people from trying it, unless “the standard” is really bad.
* We make it actively harder to improve the feature (since once it’s part of the purview of packaging standards, all of distutils-sig gets to weigh in on improvements).
It hasn't changed in years, as far as I know, and it's so widely used that any change is likely to break a load of stuff anyway. As we've already discussed for caching, we can improve by building *on top* of it relatively easily. And ultimately I think that bringing it out into daylight leads to a healthier future than leaving it under the stone marked 'setuptools''.
If I could guess, I’d say it hasn’t changed in years because setuptools has had bigger things to work on and not enough time to do it in.
On 20 October 2017 at 21:57, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Fri, Oct 20, 2017, at 12:50 PM, Donald Stufft wrote:
* We stifle innovation (hell just including it in setutools at all does this, but we can’t unopen that can of worms).
I don't think that's true to any significant extent. Having a standard does not stop people coming up with something better.
entry_points.txt will be hard to change for similar reasons to why PKG-INFO is hard to change, but that challenge exists regardless of whether we consider it a setuptools/pkg_resources feature or an ecosystem level standard, since it relates to coupling between metadata publishers and consumers of that metadata. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Oct 20, 2017, at 8:06 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 20 October 2017 at 21:15, Donald Stufft <donald@stufft.io <mailto:donald@stufft.io>> wrote: Tell you what, I’ll drop everything today and write up a PEP that adds metadata for console scripts to the packaging metadata where it belongs,
Donald, you're making the same mistake I did with PEP 426: interoperability specifications are useless without a commitment from tooling developers to actually provide implementations that know how to read and write them. And since any new format you come up with won't be supported by existing pip and pkg_resources installations, there won't be any incentive for publishers to start using it, which means there's no incentives for runtime libraries to learn how to read it, etc, etc.
Not particularly no. I can promise you 100% that pip will support it in the next version once I write it. I can also promise you that setuptools will have a PR to support it (not pkg_resources, because console scripts are a install time feature not a runtime feature), and I assume Jason would be happy to merge it. So there’s commitment from at least one tool. The “existing installations” is horse shit, because existing implementations won’t support *any* new feature of anything so it can literally be used as a justification for doing nothing about anything except standardizing what already exists. I guess we shouldn’t have done PEP 517 or PEP 518 because, by your logic here, since it won’t be supported by existing tooling, there won’t be any incentive for people to use it ever.
In this case, we already have a perfectly serviceable format (entry_points.txt), a reference publisher (setuptools.setup) and a reference consumer (pkg_resources). The fact that the reference consumer is pkg_resources rather than pip doesn't suddenly take this outside the domain of responsibility of distutils-sig as a whole - it only takes it outside the domain of responsibility of PyPI.
So if you want to say it is neither pip's nor PyPI's responsibility to say anything one way or the other about the entry points format (beyond whether or not they're used to declare console scripts in a way that pip understands), then I agree with you entirely. This spec isn't something you personally need to worry about, since it doesn't impact any of the tools you work on (aside from giving pip's existing console_scripts implementation a firmer foundation from an interoperability perpsective).
My objection has absolutely nothing to do with whether pip is the consumer or not. My objection is entirely based on the fact that a plugin system is no .a packaging related feature and it doesn’t become one because a packaging tool once added a plugin system.
So the core of our disagreement is whether or not interfaces involving pip and PyPI represent the limits of distutil-sig's responsibility. They don't, and that's reflected in the fact we have a split standing delegation from Guido (one initially to Richard Jones and later to you for changes that affect PyPI, and one to me for packaging ecosystem interoperability specifications in general)
No that’s not the core of our disagreement. The core of our disagreement is whether random runtime features suddenly become a packaging concern because they were implemented by one packaging tool once.
so we can move console_scripts entry point to a legacy footnote as far as packaging systems go. Then we can discuss whether an arbitrary plugin system is actually a packaging related spec (it’s not) on it’s own merits.
Instructing publishing system developers on how to publish pkg_resources compatible entry points is indeed a Python packaging ecosystem level concern.
No it’s really not.
Whether that capability survives into a hypothetical future metadata specification (whether that's PEP 426+459 or something else entirely) would then be a different question, but it isn't one we need to worry about right now (since it purely affects internal interoperability file formats that only automated scripts and folks maintaining those scripts need to care about, and we'd expect entry_points.txt and PKG-INFO to coexist alongside any new format for a *long* time).
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com <mailto:ncoghlan@gmail.com> | Brisbane, Australia
On 20 October 2017 at 22:10, Donald Stufft <donald@stufft.io> wrote:
If I could guess, I’d say it hasn’t changed in years because setuptools has had bigger things to work on and not enough time to do it in.
Then you'd be wrong - it hasn't changed in years because it's a sensible, simple solution to the problem of declaring integration points between independently distributed pieces of software that allows the installed integration points to be listed *without* importing the software providing them (unlike most import based plugin systems). And yes, I know you're attempting to claim that "declaring integration points between independently distributed pieces of software" isn't something that's a packaging ecosystem level to concern. It is an ecosystem level concern, but we haven't had to worry about it previously, because entry points haven't had problems to be fixed the way that other aspects of setuptools have (lack of uninstall support in easy_install, lack of filesystem layout metadata in eggs, ordering quirks in the versioning scheme). For entry points, by contrast, the only missing piece is explicit documentation of the file format used in distribution archives and the installation database. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
I guess we shouldn’t have done PEP 517 or PEP 518 because, by your logic here, since it won’t be supported by existing tooling, there won’t be any incentive for people to use it ever. I see this as having a similar purpose to those PEPs: reducing dependence on setuptools. The difference is that for building packages,
On Fri, Oct 20, 2017, at 01:18 PM, Donald Stufft wrote: pip explicitly uses setuptools, so the practical way forward was to define an alternative to achieve the same ends. For this, the existing mechanism does not directly rely on setuptools, so it's sufficient to document it so that other tools can confidently produce and consume it. I also get annoyed at times by arguments that it's not worth improving something because it will be a long time before the change is useful. But I don't think that's what Nick is saying here. Thomas
On 20 October 2017 at 22:18, Donald Stufft <donald@stufft.io> wrote:
The “existing installations” is horse shit, because existing implementations won’t support *any* new feature of anything so it can literally be used as a justification for doing nothing about anything except standardizing what already exists. I guess we shouldn’t have done PEP 517 or PEP 518 because, by your logic here, since it won’t be supported by existing tooling, there won’t be any incentive for people to use it ever.
No, because PEP 517 and 518 actually change the UX for *publishers* away from setup.py to pyproject.toml + whatever build system they choose, while allowing the definition of a *common* setup.py shim for compatibility with older clients. By contrast, it's relatively rare for people to edit entry_points.txt by hand - it's typically a generated file, just like PKG-INFO. For any *new* console_scripts replacement, you're also going to have define how to translate it back to entry_points.txt for compatibility with older pip installations, and that means you're also going to have to define how to do that without conflicting with any other pkg_resources entry points already declared by a package. Those two characteristics mean that entry_points.txt has a lot more in common with PKG-INFO than it does with setup.py, and that similarity is further enhanced by the fact that it's a pretty easy format to document.
So if you want to say it is neither pip's nor PyPI's responsibility to say anything one way or the other about the entry points format (beyond whether or not they're used to declare console scripts in a way that pip understands), then I agree with you entirely. This spec isn't something you personally need to worry about, since it doesn't impact any of the tools you work on (aside from giving pip's existing console_scripts implementation a firmer foundation from an interoperability perpsective).
My objection has absolutely nothing to do with whether pip is the consumer or not. My objection is entirely based on the fact that a plugin system is no .a packaging related feature and it doesn’t become one because a packaging tool once added a plugin system.
You're acting like you believe you have veto power over this topic. You don't - it's not a PyPI related concern, and it doesn't require any changes to pip or warehouse. I'd certainly be *happier* if you were only -0 rather than -1, but your disapproval won't prevent me from accepting Thomas's PR either way. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Oct 20, 2017, at 8:23 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 20 October 2017 at 22:10, Donald Stufft <donald@stufft.io <mailto:donald@stufft.io>> wrote: If I could guess, I’d say it hasn’t changed in years because setuptools has had bigger things to work on and not enough time to do it in.
Then you'd be wrong - it hasn't changed in years because it's a sensible, simple solution to the problem of declaring integration points between independently distributed pieces of software that allows the installed integration points to be listed *without* importing the software providing them (unlike most import based plugin systems).
I mean no I’m not. Entry points have a lot of problems and I know of multiple systems that have either moved away from them, had to hack around how bad they are, have refused to implement them because of previous pain felt by them, are looking for ways to eliminate them, or which just regret ever supporting them.
On Oct 20, 2017, at 8:34 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
You're acting like you believe you have veto power over this topic. You don't - it's not a PyPI related concern, and it doesn't require any changes to pip or warehouse.
I'd certainly be *happier* if you were only -0 rather than -1, but your disapproval won't prevent me from accepting Thomas's PR either way.
I’m acting like I have an opinion. You’re obviously free to accept something that I think is a bad idea, that doesn’t mean I should just shut up and not voice my concerns or objections and I’d appreciate it if you didn’t imply that I should.
Entry points have a lot of problems and I know of multiple systems that have either moved away from them, had to hack around how bad they are, have refused to implement them because of previous pain felt by them, are looking for ways to eliminate them, or which just regret ever supporting them. The fate of the PR notwithstanding, I'd be interested in hearing more about what problems projects have experienced with entry points, if you have time to describe some examples. We're looking at using them in more
On Fri, Oct 20, 2017, at 01:36 PM, Donald Stufft wrote: places than we already do, so it would be useful to hear about drawbacks we might not have thought about, and about what other options projects have moved to. Thomas
On Friday, October 20, 2017, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Fri, Oct 20, 2017, at 01:36 PM, Donald Stufft wrote:
Entry points have a lot of problems and I know of multiple systems that have either moved away from them, had to hack around how bad they are, have refused to implement them because of previous pain felt by them, are looking for ways to eliminate them, or which just regret ever supporting them.
The fate of the PR notwithstanding, I'd be interested in hearing more about what problems projects have experienced with entry points, if you have time to describe some examples. We're looking at using them in more places than we already do, so it would be useful to hear about drawbacks we might not have thought about, and about what other options projects have moved to.
Thomas
What were the issues with setuptools entry points here (in 2014, when you two were opposed to adding them to sendibly list ipython extensions)? https://github.com/ipython/ipython/pull/4673 https://github.com/ipython/ipython/compare/master...westurner:setuptools_ent...
What were the issues with setuptools entry points here (in 2014, when you two were opposed to adding them to sendibly list ipython extensions)? I'm impressed by your memory! The main issue then was that it implied
On Fri, Oct 20, 2017, at 01:58 PM, Wes Turner wrote: that extension authors would have to use setuptools. Setuptools has got much better since then, we have better tools and norms for dealing with its rough edges, and there are usable alternative tools that can be used to distribute entrypoints. But the description I've written up is still basically trying to solve the same problem: an application should be able to use entry points without forcing all plugins to use setuptools. Thomas
On Oct 20, 2017, at 8:41 AM, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Fri, Oct 20, 2017, at 01:36 PM, Donald Stufft wrote:
Entry points have a lot of problems and I know of multiple systems that have either moved away from them, had to hack around how bad they are, have refused to implement them because of previous pain felt by them, are looking for ways to eliminate them, or which just regret ever supporting them.
The fate of the PR notwithstanding, I'd be interested in hearing more about what problems projects have experienced with entry points, if you have time to describe some examples. We're looking at using them in more places than we already do, so it would be useful to hear about drawbacks we might not have thought about, and about what other options projects have moved to.
One that I was helping someone debug just the other day is that they’re super non-debuggable and the behavior when you have two things providing the same entry point is basically ???? (If I remember correctly, the behavior is that the first thing found is the one that “wins”, which means the ordering of sys.path and the names of the projects supply it is what determines it). This got exposed to the end user that they installed something that they thought was going to add support for something, but which silently did nothing because two different project happened to pick the same name for their entry point (not the group, it was two things providing plugins for the same system). Of course there is the perennial entrypoints are super slow, which is partially the fault of pkg_resources does a bunch of import time logic, but also because scanning sys.path for all installed stuff is just slow. They’re also somewhat fragile since they rely on the packaging metadata system at runtime, and a number of tools exclude that information (often times things that deploy stuff as a tarball/zipfile) which causes regular issues to be opened up for these projects when they get used in those environments. Those are the ones I remember because they come up regularly (and people regularly come to me with issues with any project related to packaging in any way even for non packaging related features in those projects). I’m pretty sure there were more of them that I’ve encountered and seen projects encounter, but I can’t remember them to be sure. I’m more familiar with why console_scripts entry point is not great and why we should stop using it since I regularly try to re-read all of pip’s issues and a lot of it’s issues are documented there.
On 20 October 2017 at 23:19, Donald Stufft <donald@stufft.io> wrote:
One that I was helping someone debug just the other day is that they’re super non-debuggable and the behavior when you have two things providing the same entry point is basically ???? (If I remember correctly, the behavior is that the first thing found is the one that “wins”, which means the ordering of sys.path and the names of the projects supply it is what determines it). This got exposed to the end user that they installed something that they thought was going to add support for something, but which silently did nothing because two different project happened to pick the same name for their entry point (not the group, it was two things providing plugins for the same system).
While I agree with this, I think that's a combination of pkg_resources itself being hard to debug in general, and the fact that pkg_resources doesn't clearly define the semantics of how it resolves name conflicts within an entry point group - as far as I know, it's largely an accident of implementation. The interoperability spec is going to state that conflict resolution when the same name within a group is declared by multiple packages is the responsibility of the group consumer, so documenting the format should actually improve this situation, since it allows for the development of competing conflict resolution strategies in different runtime libraries.
Of course there is the perennial entrypoints are super slow, which is partially the fault of pkg_resources does a bunch of import time logic, but also because scanning sys.path for all installed stuff is just slow.
Similar to the above, one of the goals of documenting the entry point file format is to permit libraries to compete in the development of effective entrypoint metadata caching strategies without needing to bless any particular one a priori, and without trying to manage experimental cache designs across the huge pkg_resources install base.
They’re also somewhat fragile since they rely on the packaging metadata system at runtime, and a number of tools exclude that information (often times things that deploy stuff as a tarball/zipfile) which causes regular issues to be opened up for these projects when they get used in those environments.
This is true, and one of the main pragmatic benefits of adopting one of the purely import based plugin management systems. However, this problem will impact all packaging metadata based plugin management solutions, regardless of whether they use an existing file format or a new one.
Those are the ones I remember because they come up regularly (and people regularly come to me with issues with any project related to packaging in any way even for non packaging related features in those projects). I’m pretty sure there were more of them that I’ve encountered and seen projects encounter, but I can’t remember them to be sure.
I’m more familiar with why console_scripts entry point is not great and why we should stop using it since I regularly try to re-read all of pip’s issues and a lot of it’s issues are documented there.
I'm sympathetic to that, but I think even in that case, clearly documenting the format as an interoperability specification will help tease out which of those are due to the file format itself, and which are due to setuptools.setup specifically. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Oct 20, 2017, at 9:35 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 20 October 2017 at 23:19, Donald Stufft <donald@stufft.io <mailto:donald@stufft.io>> wrote: One that I was helping someone debug just the other day is that they’re super non-debuggable and the behavior when you have two things providing the same entry point is basically ???? (If I remember correctly, the behavior is that the first thing found is the one that “wins”, which means the ordering of sys.path and the names of the projects supply it is what determines it). This got exposed to the end user that they installed something that they thought was going to add support for something, but which silently did nothing because two different project happened to pick the same name for their entry point (not the group, it was two things providing plugins for the same system).
While I agree with this, I think that's a combination of pkg_resources itself being hard to debug in general, and the fact that pkg_resources doesn't clearly define the semantics of how it resolves name conflicts within an entry point group - as far as I know, it's largely an accident of implementation.
The interoperability spec is going to state that conflict resolution when the same name within a group is declared by multiple packages is the responsibility of the group consumer, so documenting the format should actually improve this situation, since it allows for the development of competing conflict resolution strategies in different runtime libraries.
I think it makes it *worse*, because now the behavior isn’t just a entrypoints weirdness, but now it changes based on which runtime library you use (which isn’t something that end users are likely to have much insight into) and it represents a footgun that package authors are unlikely to be aware of. If mycoolentrypointslib comes out that is faster, but changes some subtle behavior like this it’ll break people, but that is unlikely going to be an effect that people expect to happen just because they switched between two things both implementing the same standard. So effectively this means that not only is the fact you’re using entrypoints part of your API, but now which entry point library you’re using at runtime is now also part of your API.
Of course there is the perennial entrypoints are super slow, which is partially the fault of pkg_resources does a bunch of import time logic, but also because scanning sys.path for all installed stuff is just slow.
Similar to the above, one of the goals of documenting the entry point file format is to permit libraries to compete in the development of effective entrypoint metadata caching strategies without needing to bless any particular one a priori, and without trying to manage experimental cache designs across the huge pkg_resources install base.
That goal can be achieved if it’s documented in setuptools.
They’re also somewhat fragile since they rely on the packaging metadata system at runtime, and a number of tools exclude that information (often times things that deploy stuff as a tarball/zipfile) which causes regular issues to be opened up for these projects when they get used in those environments.
This is true, and one of the main pragmatic benefits of adopting one of the purely import based plugin management systems. However, this problem will impact all packaging metadata based plugin management solutions, regardless of whether they use an existing file format or a new one.
Those are the ones I remember because they come up regularly (and people regularly come to me with issues with any project related to packaging in any way even for non packaging related features in those projects). I’m pretty sure there were more of them that I’ve encountered and seen projects encounter, but I can’t remember them to be sure.
I’m more familiar with why console_scripts entry point is not great and why we should stop using it since I regularly try to re-read all of pip’s issues and a lot of it’s issues are documented there.
I'm sympathetic to that, but I think even in that case, clearly documenting the format as an interoperability specification will help tease out which of those are due to the file format itself, and which are due to setuptools.setup specifically.
All of the ones I’m aware of are due to the file format itself, because they exist even without setuptools being involved at all.
On Friday, October 20, 2017, Donald Stufft <donald@stufft.io> wrote:
On Oct 20, 2017, at 9:35 AM, Nick Coghlan <ncoghlan@gmail.com <javascript:_e(%7B%7D,'cvml','ncoghlan@gmail.com');>> wrote:
On 20 October 2017 at 23:19, Donald Stufft <donald@stufft.io <javascript:_e(%7B%7D,'cvml','donald@stufft.io');>> wrote:
One that I was helping someone debug just the other day is that they’re super non-debuggable and the behavior when you have two things providing the same entry point is basically ???? (If I remember correctly, the behavior is that the first thing found is the one that “wins”, which means the ordering of sys.path and the names of the projects supply it is what determines it). This got exposed to the end user that they installed something that they thought was going to add support for something, but which silently did nothing because two different project happened to pick the same name for their entry point (not the group, it was two things providing plugins for the same system).
While I agree with this, I think that's a combination of pkg_resources itself being hard to debug in general, and the fact that pkg_resources doesn't clearly define the semantics of how it resolves name conflicts within an entry point group - as far as I know, it's largely an accident of implementation.
The interoperability spec is going to state that conflict resolution when the same name within a group is declared by multiple packages is the responsibility of the group consumer, so documenting the format should actually improve this situation, since it allows for the development of competing conflict resolution strategies in different runtime libraries.
I think it makes it *worse*, because now the behavior isn’t just a entrypoints weirdness, but now it changes based on which runtime library you use (which isn’t something that end users are likely to have much insight into) and it represents a footgun that package authors are unlikely to be aware of. If mycoolentrypointslib comes out that is faster, but changes some subtle behavior like this it’ll break people, but that is unlikely going to be an effect that people expect to happen just because they switched between two things both implementing the same standard.
So effectively this means that not only is the fact you’re using entrypoints part of your API, but now which entry point library you’re using at runtime is now also part of your API.
When should the check for duplicate entry points occur? - At on_install() time (+1) - At runtime Is a sys.path-like OrderedDict preemptive strategy preferable or just as dangerous as importlib?
On Wed, 18 Oct 2017 at 17:54 Nick Coghlan <ncoghlan@gmail.com> wrote:
On 19 October 2017 at 04:18, Alex Grönholm <alex.gronholm@nextday.fi> wrote:
Daniel Holth kirjoitti 18.10.2017 klo 21:06:
http://setuptools.readthedocs.io/en/latest/formats.html?highlight=entry_poin...
http://setuptools.readthedocs.io/en/latest/pkg_resources.html?highlight=pkg_...
It is not very complicated. It looks like the characters are mostly 'python identifier' rules with a little bit of 'package name' rules.
I am also concerned about the amount of parsing on startup. A hard problem for certain, since no one likes outdated cache problems either. It is also unpleasant to have too much code with a runtime dependency on 'packaging'.
Wasn't someone working on implementing pkg_resources in the standard library at some point?
The idea has been raised, but we've been hesitant for the same reason we're inclined to take distutils out: packaging APIs need to be free to evolve in line with packaging interoperability standards, rather than with the Python language definition.
Barry Warsaw & Brett Cannon recently mentioned something to me about working on a potential runtime alternative to pkg_resources that could be installed without also installing setuptools, but I don't know any of the specifics (and I'm not sure either of them follows distutils-sig).
I've been following distutils-sig for a couple of years now. :) And what Barry and I are working on is only a subset of pkg_resources, specifically the reading of data files included in a package. We aren't touching any other aspect of pkg_resources. Heck, until this discussion, "entry points" == "console scripts" for me so I don't really know what y'all are talking about standardizing when it comes to plug-in systems and metadata. Having said that, I do understand why Donald doesn't want to just go ahead and standardize something by giving it the level of a spec on packaging.python.org just because it's out there. But since entry points seem to be used widely enough, having them written down appropriately also seems reasonable. As a compromise, could entry points be documented as Thomas is suggesting, but have a note at the top saying something along the lines of "entry points are considered a setuptools-specific feature, but their wide spread use warrants a clear understanding of how they function for other packaging tools choose on their own to also support them"? Basically acknowledge there are ad-hoc, folk standards in the community that a decent chunk of people rely on and thus docs would be helpful, but don't need to be promoted to full-on, everyone-implements standard.
Excerpts from Wes Turner's message of 2017-10-20 10:41:02 -0400:
On Friday, October 20, 2017, Donald Stufft <donald@stufft.io> wrote:
On Oct 20, 2017, at 9:35 AM, Nick Coghlan <ncoghlan@gmail.com <javascript:_e(%7B%7D,'cvml','ncoghlan@gmail.com');>> wrote:
On 20 October 2017 at 23:19, Donald Stufft <donald@stufft.io <javascript:_e(%7B%7D,'cvml','donald@stufft.io');>> wrote:
One that I was helping someone debug just the other day is that they’re super non-debuggable and the behavior when you have two things providing the same entry point is basically ???? (If I remember correctly, the behavior is that the first thing found is the one that “wins”, which means the ordering of sys.path and the names of the projects supply it is what determines it). This got exposed to the end user that they installed something that they thought was going to add support for something, but which silently did nothing because two different project happened to pick the same name for their entry point (not the group, it was two things providing plugins for the same system).
While I agree with this, I think that's a combination of pkg_resources itself being hard to debug in general, and the fact that pkg_resources doesn't clearly define the semantics of how it resolves name conflicts within an entry point group - as far as I know, it's largely an accident of implementation.
The interoperability spec is going to state that conflict resolution when the same name within a group is declared by multiple packages is the responsibility of the group consumer, so documenting the format should actually improve this situation, since it allows for the development of competing conflict resolution strategies in different runtime libraries.
I think it makes it *worse*, because now the behavior isn’t just a entrypoints weirdness, but now it changes based on which runtime library you use (which isn’t something that end users are likely to have much insight into) and it represents a footgun that package authors are unlikely to be aware of. If mycoolentrypointslib comes out that is faster, but changes some subtle behavior like this it’ll break people, but that is unlikely going to be an effect that people expect to happen just because they switched between two things both implementing the same standard.
So effectively this means that not only is the fact you’re using entrypoints part of your API, but now which entry point library you’re using at runtime is now also part of your API.
When should the check for duplicate entry points occur?
- At on_install() time (+1) - At runtime
Is a sys.path-like OrderedDict preemptive strategy preferable or just as dangerous as importlib?
Having "duplicate" entry points is not necessarily an error. It's a different usage pattern. The semantics of dropping a named plugin into a namespace are defined by the application and plugin-point. Please do not build assumptions about uniqueness into the underlying implementation. The stevedore library wraps up pkg_resources with several such patterns. For example, it supports "give me all of the plugins in a namespace" (find all the extensions to your app), "give me all of the plugins named $name in a namespace" (find the hooks for a specific event defined by the app), and "give me *the* plugin named $name in a namespace" (load a driver for talking to a backend). https://docs.openstack.org/stevedore/latest/reference/index.html Doug
Excerpts from Nick Coghlan's message of 2017-10-20 14:42:09 +1000:
On 20 October 2017 at 02:14, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Thu, Oct 19, 2017, at 04:10 PM, Donald Stufft wrote:
I’m in favor, although one question I guess is whether it should be a a PEP or an ad hoc spec. Given (2) it should *probably* be a a PEP (since without (2), its just another file in the .dist-info directory and that doesn’t actually need standardized at all). I don’t think that this will be a very controversial PEP though, and should be pretty easy.
I have opened a PR to document what is already there, without adding any new features. I think this is worth doing even if we don't change anything, since it's a de-facto standard used for different tools to interact.
https://github.com/pypa/python-packaging-user-guide/pull/390
We can still write a PEP for caching if necessary.
+1 for that approach (PR for the status quo, PEP for a shared metadata caching design) from me
Making the status quo more discoverable is valuable in its own right, and the only decisions we'll need to make for that are terminology clarification ones, not interoperability ones (this isn't like PEP 440 or 508 where we actually thought some of the default setuptools behaviour was slightly incorrect and wanted to change it).
Figuring out a robust cross-platform network-file-system-tolerant metadata caching design on the other hand is going to be hard, and as Donald suggests, the right ecosystem level solution might be to define install-time hooks for package installation operations.
I’m also in favor of this. Although I would suggest SQLite rather than a JSON file for the primary reason being that a JSON file isn’t multiprocess safe without being careful (and possibly introducing locking) whereas SQLite has already solved that problem.
SQLite was actually my first thought, but from experience in Jupyter & IPython I'm wary of it - its built-in locking does not work well over NFS, and it's easy to corrupt the database. I think careful use of atomic writing can be more reliable (though that has given us some problems too).
That may be easier if there's one cache per user, though - we can perhaps try to store it somewhere that's not NFS.
I'm wondering if rather than jumping straight to a PEP, it may make sense to instead initially pursue this idea as a *non-*standard, implementation dependent thing specific to the "entrypoints" project. There are a *lot* of challenges to be taken into account for a truly universal metadata caching design, and it would be easy to fall into the trap of coming up with a design so complex that nobody can realistically implement it.
Specifically, I'm thinking of a usage model along the lines of the updatedb/locate pair on *nix systems: `locate` gives you access to very fast searches of your filesystem, but it *doesn't* try to automagically keeps its indexes up to date. Instead, refreshing the indexes is handled by `updatedb`, and you can either rely on that being run automatically in a cron job, or else force an update with `sudo updatedb` when you want to use `locate`.
For a project like entrypoints, what that might look like is that at *runtime*, you may implement a reasonably fast "cache freshness check", where you scanned the mtime of all the sys.path entries, and compared those to the mtime of the cache. If the cache looks up to date, then cool, otherwise emit a warning about the stale metadata cache, and then bypass it.
The entrypoints project itself could then expose a `refresh-entrypoints-cache` command that could start out only supporting virtual environments, and then extend to per-user caching, and then finally (maybe) consider whether or not it wanted to support installation-wide caches (with the extra permissions management and cross-process and cross-system coordination that may imply).
Such an approach would also tie in nicely with Donald's suggestion of reframing the ecosystem level question as "How should the entrypoints project request that 'refresh-entrypoints-cache' be run after every package installation or removal operation?", which in turn would integrate nicely with things like RPM file triggers (where the system `pip` package could set a file trigger that arranged for any properly registered Python package installation plugins to be run for every modification to site-packages while still appropriately managing the risk of running arbitrary code with elevated privileges)
Cheers, Nick.
I have been trying to find time to do something like that within stevedore for a while to solve some client-side startup performance issues with the OpenStack client. I would be happy to help add it to entrypoints instead and use it from there. Thomas, please me know how I can help. Doug
On Fri, Oct 20, 2017 at 08:10:06AM -0400, Donald Stufft wrote:
Packaging tools shouldn’t be expected to know anything about it other than the console_scripts feature
Please do not forget about gui_scripts entry points! Marius Gedminas -- What can I do with Python that I can't do with C#? You can go home on time at the end of the day. -- Daniel Klein
On Fri, Oct 20, 2017, at 07:24 PM, Doug Hellmann wrote:
I have been trying to find time to do something like that within stevedore for a while to solve some client-side startup performance issues with the OpenStack client. I would be happy to help add it to entrypoints instead and use it from there.
Thomas, please me know how I can help.
Thanks Doug! For starters, I'd be interested to hear any plans you have for how to tackle caching, or any thoughts you have on the rough plan I described before. If you're happy with the concepts, I'll have a go at implementing it. I'll probably consider it experimental until there's a hooks mechanism to trigger rebuilding the cache when packages are installed or uninstalled. Thomas
On Fri, Oct 20, 2017, at 07:31 PM, Marius Gedminas wrote:
Please do not forget about gui_scripts entry points!
I haven't forgotten about them in the draft spec: https://github.com/pypa/python-packaging-user-guide/pull/390/files#diff-089b...
Excerpts from Thomas Kluyver's message of 2017-10-20 19:37:45 +0100:
On Fri, Oct 20, 2017, at 07:24 PM, Doug Hellmann wrote:
I have been trying to find time to do something like that within stevedore for a while to solve some client-side startup performance issues with the OpenStack client. I would be happy to help add it to entrypoints instead and use it from there.
Thomas, please me know how I can help.
Thanks Doug! For starters, I'd be interested to hear any plans you have for how to tackle caching, or any thoughts you have on the rough plan I described before. If you're happy with the concepts, I'll have a go at implementing it. I'll probably consider it experimental until there's a hooks mechanism to trigger rebuilding the cache when packages are installed or uninstalled.
Thomas
I assumed that the user loading the plugins might not be able to write to any of the directories on sys.path (aside from "." and we don't want to put a cache file there), so my plan was to build the cache the first time entry points were scanned and use appdirs [1] to pick a cache location specific to the user. I thought I would use the value of sys.path as a string (joining the paths together with a separator of some sort) to create a hash for the cache file ID. Some of that may be obviated if we assume a setuptools hook that lets us update the cache(s) when a package is installed. I also thought I'd provide a command line tool to generate the cache just in case it became corrupted or if someone wanted to update it by hand for some other reason, similar to Nick's locate/updatedb parallel UX example (and re-reading your email, I see you mention this, too). I hadn't gone as far as deciding on a file format, but sqlite, JSON, and INI (definitely something built-in) were all on my mind. I planned to see if we would actually gain enough of a boost just by placing a separate file for each dist in a single cache directory, rather than trying to merge everything into one file. In addition to eliminating the concurrency issue, that approach might have the additional benefit of simplifying operating system packages, because they could just add a new file to the package instead of having to run a command to update the cache when a package was installed (if the file is the same format as entry_points.txt but with a different name, that's even simpler since it's just a copy of a file that will already be available during packaging). Your idea of having a cache file per directory on sys.path is also interesting, though I have to admit I'm not familiar enough with the import machinery to know if it's easy to determine the containing directory for a dist to find the right cache to update. I am interested in hearing more details about what you planned there. I would also like to compare the performance of a few approaches (1 file per sys.path hash using INI, JSON, and sqlite; one file per entry on sys.path using the same formats) using a significant number of plugins (~100?) before we decide. I agree with your statement in the original email that applications should be able to disable the cache. I'm not sure it makes sense to have a mode that only reads from a cache, but I may just not see the use case for that. What's our next step? Doug [1] https://pypi.python.org/pypi/appdirs/1.4.3
On Friday, October 20, 2017, Doug Hellmann <doug@doughellmann.com> wrote:
On Friday, October 20, 2017, Donald Stufft <donald@stufft.io <javascript:;>> wrote:
On Oct 20, 2017, at 9:35 AM, Nick Coghlan <ncoghlan@gmail.com
<javascript:;>
<javascript:_e(%7B%7D,'cvml','ncoghlan@gmail.com <javascript:;>');>> wrote:
On 20 October 2017 at 23:19, Donald Stufft <donald@stufft.io <javascript:;> <javascript:_e(%7B%7D,'cvml','donald@stufft.io <javascript:;>');>> wrote:
One that I was helping someone debug just the other day is that
super non-debuggable and the behavior when you have two things
the same entry point is basically ???? (If I remember correctly, the behavior is that the first thing found is the one that “wins”, which means the ordering of sys.path and the names of the projects supply it is what determines it). This got exposed to the end user that they installed something that they thought was going to add support for something, but which silently did nothing because two different project happened to
the same name for their entry point (not the group, it was two things providing plugins for the same system).
While I agree with this, I think that's a combination of pkg_resources itself being hard to debug in general, and the fact that pkg_resources doesn't clearly define the semantics of how it resolves name conflicts within an entry point group - as far as I know, it's largely an accident of implementation.
The interoperability spec is going to state that conflict resolution when the same name within a group is declared by multiple packages is the responsibility of the group consumer, so documenting the format should actually improve this situation, since it allows for the development of competing conflict resolution strategies in different runtime
I think it makes it *worse*, because now the behavior isn’t just a entrypoints weirdness, but now it changes based on which runtime
Excerpts from Wes Turner's message of 2017-10-20 10:41:02 -0400: they’re providing pick libraries. library
you use (which isn’t something that end users are likely to have much insight into) and it represents a footgun that package authors are unlikely to be aware of. If mycoolentrypointslib comes out that is faster, but changes some subtle behavior like this it’ll break people, but that is unlikely going to be an effect that people expect to happen just because they switched between two things both implementing the same standard.
So effectively this means that not only is the fact you’re using entrypoints part of your API, but now which entry point library you’re using at runtime is now also part of your API.
When should the check for duplicate entry points occur?
- At on_install() time (+1) - At runtime
Is a sys.path-like OrderedDict preemptive strategy preferable or just as dangerous as importlib?
Having "duplicate" entry points is not necessarily an error. It's a different usage pattern. The semantics of dropping a named plugin into a namespace are defined by the application and plugin-point. Please do not build assumptions about uniqueness into the underlying implementation.
I think that, at least with console_scripts, we already assume uniqueness: if there's another package which provides a 'pip' console_script, for example, there's not yet an error message? Would it be helpful to at least spec that iterated entrypoints are in sys.path order? And then what about entrypoints coming from the same path in sys.path: alphabetical? Whatever hash randomization does with it? Whenever I feel unsure about my data model, I tend to sometimes read the OWL spec: here, the OWL spec has owl:cardinality OR owl:minCardinality OR owl:maxCardinality. Some entrypoints may have 0, only one, or n "instances"? We should throw an error if a given console_script entrypoint has more than one "instance" (exceeds maxCardinality xsd:string = 1).
The stevedore library wraps up pkg_resources with several such patterns. For example, it supports "give me all of the plugins in a namespace" (find all the extensions to your app), "give me all of the plugins named $name in a namespace" (find the hooks for a specific event defined by the app), and "give me *the* plugin named $name in a namespace" (load a driver for talking to a backend).
https://docs.openstack.org/stevedore/latest/reference/index.html
https://github.com/openstack/stevedore/blob/master/stevedore/extension.py https://github.com/openstack/stevedore/blob/master/stevedore/tests/test_exte... These tests mention saving discovered entry points in a cache?
On Oct 19, 2017 11:10, "Donald Stufft" <donald@stufft.io> wrote: EXCEPT, for the fact that with the desire to cache things, it would be beneficial to “hook” into the lifecycle of a package install. However I know that there are other plugin systems out there that would like to also be able to do that (Twisted Plugins come to mind) and that I think outside of plugin systems, such a mechanism is likely to be useful in general for other cases. So heres a different idea that is a bit more ambitious but that I think is a better overall idea. Let entrypoints be a setuptools thing, and lets define some key lifecycle hooks during the installation of a package and some mechanism in the metadata to let other tools subscribe to those hooks. Then a caching layer could be written for setuptools entrypoints to make that faster without requiring standardization, but also a whole new, better plugin system could to, Twisted plugins could benefit, etc [1]. In this hypothetical system, how do installers like pip find the list of hooks to call? By looking up an entrypoint? (Sorry if this was discussed downthread; I didn't see it but I admit I only skimmed.) -n
I like the idea of lifecycle hooks but I worry about the malware problem; would there need to be a blacklist / whitelist / disable system? (ignore-scripts=true is now a recommended part of anyone's npm configuration) That is why we have avoided any kind of (package specific) hooks to wheel. However hooks would be a very elegant way to avoid worrying about core pip functionality since it wouldn't be core functionality. On Fri, Oct 20, 2017 at 4:41 PM Nathaniel Smith <njs@pobox.com> wrote:
On Oct 19, 2017 11:10, "Donald Stufft" <donald@stufft.io> wrote:
EXCEPT, for the fact that with the desire to cache things, it would be beneficial to “hook” into the lifecycle of a package install. However I know that there are other plugin systems out there that would like to also be able to do that (Twisted Plugins come to mind) and that I think outside of plugin systems, such a mechanism is likely to be useful in general for other cases.
So heres a different idea that is a bit more ambitious but that I think is a better overall idea. Let entrypoints be a setuptools thing, and lets define some key lifecycle hooks during the installation of a package and some mechanism in the metadata to let other tools subscribe to those hooks. Then a caching layer could be written for setuptools entrypoints to make that faster without requiring standardization, but also a whole new, better plugin system could to, Twisted plugins could benefit, etc [1].
In this hypothetical system, how do installers like pip find the list of hooks to call? By looking up an entrypoint? (Sorry if this was discussed downthread; I didn't see it but I admit I only skimmed.)
-n
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Excerpts from Nathaniel Smith's message of 2017-10-20 13:41:03 -0700:
On Oct 19, 2017 11:10, "Donald Stufft" <donald@stufft.io> wrote:
EXCEPT, for the fact that with the desire to cache things, it would be beneficial to “hook” into the lifecycle of a package install. However I know that there are other plugin systems out there that would like to also be able to do that (Twisted Plugins come to mind) and that I think outside of plugin systems, such a mechanism is likely to be useful in general for other cases.
So heres a different idea that is a bit more ambitious but that I think is a better overall idea. Let entrypoints be a setuptools thing, and lets define some key lifecycle hooks during the installation of a package and some mechanism in the metadata to let other tools subscribe to those hooks. Then a caching layer could be written for setuptools entrypoints to make that faster without requiring standardization, but also a whole new, better plugin system could to, Twisted plugins could benefit, etc [1].
Having post-install and pre-uninstall hooks should be sufficient for updating a cache, assuming the hook could be given enough information about the thing being manipulated to probe for whatever data it needs.
In this hypothetical system, how do installers like pip find the list of hooks to call? By looking up an entrypoint? (Sorry if this was discussed downthread; I didn't see it but I admit I only skimmed.)
That's how I would expect it to work. Using setuptools most likely? That would mean that other plugin systems would have to provide one setuptools plugin to hook into the installer to build a lookup cache, but the actual plugins wouldn't have to use setuptools for anything. Doug
On 21 October 2017 at 06:50, Daniel Holth <dholth@gmail.com> wrote:
I like the idea of lifecycle hooks but I worry about the malware problem; would there need to be a blacklist / whitelist / disable system? (ignore-scripts=true is now a recommended part of anyone's npm configuration) That is why we have avoided any kind of (package specific) hooks to wheel. However hooks would be a very elegant way to avoid worrying about core pip functionality since it wouldn't be core functionality.
Yeah, here's the gist of what I had in mind regarding the malware problem (i.e. aiming to ensure we don't get all of setup.py's problems back again): - a package's own install hooks do *not* get called for that package - hooks only run by default inside a virtualenv as a regular user - outside a virtualenv, the default is "hooks don't get run at all" - when running with elevated privileges, the default is "hooks don't get run at all" There are still some open questions with it (like what to do with hooks defined in packages that get implicitly coinstalled as a dependency), and having the default behaviour depend on both "venv or not" and "superuser or not" may prove confusing, but it would avoid a number of the things we dislike about install-time setup.py invocation. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 20 October 2017 at 23:42, Donald Stufft <donald@stufft.io> wrote:
On Oct 20, 2017, at 9:35 AM, Nick Coghlan <ncoghlan@gmail.com> wrote: The interoperability spec is going to state that conflict resolution when the same name within a group is declared by multiple packages is the responsibility of the group consumer, so documenting the format should actually improve this situation, since it allows for the development of competing conflict resolution strategies in different runtime libraries.
I think it makes it *worse*, because now the behavior isn’t just a entrypoints weirdness, but now it changes based on which runtime library you use (which isn’t something that end users are likely to have much insight into) and it represents a footgun that package authors are unlikely to be aware of. If mycoolentrypointslib comes out that is faster, but changes some subtle behavior like this it’ll break people, but that is unlikely going to be an effect that people expect to happen just because they switched between two things both implementing the same standard.
So effectively this means that not only is the fact you’re using entrypoints part of your API, but now which entry point library you’re using at runtime is now also part of your API.
The semantics of conflict resolution across different projects is a concern that mainly affects app developers a large established plugin base, and even with pkg_resources the question of whether or not multiple projects re-using the same entrypoint name is a problem depends on how the application uses that information. With console_scripts and gui_scripts, name conflicts can definitely be a problem, since different projects will end up fighting over the same filename for their executable script wrapper. For other use cases (like some of the ones Doug described for stevedore), it's less of a concern, because the names never get collapsed into a single flat namespace the way script wrappers do. Cheers, Nick. P.S. Thanks for your comments on the PR - they're helping to make sure we accurately capture the status quo. I'm also going to file an issue on the setuptools issue tracker to make sure Jason is aware of what we're doing, and get his explicit OK with the idea of making the format a PyPA interoperability specification (if he isn't, we'll demote Thomas's document to being a guide for tool developers aiming for pkg_resources interoperability). -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 21 October 2017 at 05:26, Doug Hellmann <doug@doughellmann.com> wrote:
I would also like to compare the performance of a few approaches (1 file per sys.path hash using INI, JSON, and sqlite; one file per entry on sys.path using the same formats) using a significant number of plugins (~100?) before we decide.
If you can manage it, you'll want to run at least some of those tests with the plugins and their metadata mounted via a network drive. When the import system switched from multiple stat calls to cached os.listdir() lookups, SSD and spinning disk imports received a minor speedup, but NFS imports improved *dramatically* (folks reported order of magnitude improvements, along the lines of startup times dropping from 2-3 seconds to 200-300 ms). I'd expect to see a similar pattern here - inefficient file access patterns can be tolerable with an SSD, and even spinning disks, but the higher latency involved in accessing network drives will make you pay for it. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Saturday, October 21, 2017, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 20 October 2017 at 23:42, Donald Stufft <donald@stufft.io <javascript:_e(%7B%7D,'cvml','donald@stufft.io');>> wrote:
On Oct 20, 2017, at 9:35 AM, Nick Coghlan <ncoghlan@gmail.com <javascript:_e(%7B%7D,'cvml','ncoghlan@gmail.com');>> wrote: The interoperability spec is going to state that conflict resolution when the same name within a group is declared by multiple packages is the responsibility of the group consumer, so documenting the format should actually improve this situation, since it allows for the development of competing conflict resolution strategies in different runtime libraries.
I think it makes it *worse*, because now the behavior isn’t just a entrypoints weirdness, but now it changes based on which runtime library you use (which isn’t something that end users are likely to have much insight into) and it represents a footgun that package authors are unlikely to be aware of. If mycoolentrypointslib comes out that is faster, but changes some subtle behavior like this it’ll break people, but that is unlikely going to be an effect that people expect to happen just because they switched between two things both implementing the same standard.
So effectively this means that not only is the fact you’re using entrypoints part of your API, but now which entry point library you’re using at runtime is now also part of your API.
The semantics of conflict resolution across different projects is a concern that mainly affects app developers a large established plugin base, and even with pkg_resources the question of whether or not multiple projects re-using the same entrypoint name is a problem depends on how the application uses that information.
With console_scripts and gui_scripts, name conflicts can definitely be a problem, since different projects will end up fighting over the same filename for their executable script wrapper.
For other use cases (like some of the ones Doug described for stevedore), it's less of a concern, because the names never get collapsed into a single flat namespace the way script wrappers do.
Cheers, Nick.
P.S. Thanks for your comments on the PR - they're helping to make sure we accurately capture the status quo. I'm also going to file an issue on the setuptools issue tracker to make sure Jason is aware of what we're doing, and get his explicit OK with the idea of making the format a PyPA interoperability specification (if he isn't, we'll demote Thomas's document to being a guide for tool developers aiming for pkg_resources interoperability).
What are the URIs for this PR and issue?
-- Nick Coghlan | ncoghlan@gmail.com <javascript:_e(%7B%7D,'cvml','ncoghlan@gmail.com');> | Brisbane, Australia
On 21 October 2017 at 18:04, Wes Turner <wes.turner@gmail.com> wrote:
On Saturday, October 21, 2017, Nick Coghlan <ncoghlan@gmail.com> wrote:
I'm also going to file an issue on the setuptools issue tracker to make sure Jason is aware of what we're doing, and get his explicit OK with the idea of making the format a PyPA interoperability specification (if he isn't, we'll demote Thomas's document to being a guide for tool developers aiming for pkg_resources interoperability).
What are the URIs for this PR and issue?
New setuptools issue: https://github.com/pypa/setuptools/issues/1179 (I hadn't filed it yet when I wrote the previous comment) Thomas's PR: https://github.com/pypa/python-packaging-user-guide/pull/390 Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 21 October 2017 at 18:21, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 21 October 2017 at 18:04, Wes Turner <wes.turner@gmail.com> wrote:
On Saturday, October 21, 2017, Nick Coghlan <ncoghlan@gmail.com> wrote:
I'm also going to file an issue on the setuptools issue tracker to make sure Jason is aware of what we're doing, and get his explicit OK with the idea of making the format a PyPA interoperability specification (if he isn't, we'll demote Thomas's document to being a guide for tool developers aiming for pkg_resources interoperability).
What are the URIs for this PR and issue?
New setuptools issue: https://github.com/pypa/setuptools/issues/1179 (I hadn't filed it yet when I wrote the previous comment) Thomas's PR: https://github.com/pypa/python-packaging-user-guide/pull/390
With Jason's +1 on the setuptools issue, I've gone ahead and hit the merge button on Thomas's PR: https://github.com/pypa/python-packaging-user-guide/commit/34c37f0e66821127a... The spec is now available here https://packaging.python.org/specifications/entry-points/, and clarifications and corrections can be submitted as follow-up PRs (as for other PyPA specifications). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, Oct 20, 2017 at 11:59 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Yeah, here's the gist of what I had in mind regarding the malware problem (i.e. aiming to ensure we don't get all of setup.py's problems back again):
- a package's own install hooks do *not* get called for that package
Doesn't that break the entry point caching use case that started this whole discussion? When you first install the caching package, then it has to immediately build the cache for the first time. I don't really have the time or interest to dig into this (I know there are legitimate use cases for entry points but I'm very wary of any feature where package A starts doing something different because package B was installed). But, I just wanted to throw out that I see at least two reasons we might want to "bake in" the caching as part of our PEPified metadata: - if we do want to add "install hooks", then we need some way for a package to declare it has an install hook and for pip-or-whoever to find it. The natural way would be to use an entry point, which means entry points are in some sense "more fundamental" than install hooks. - in general, the only thing that can update an entry-point cache is the package that's doing the install, at the time it runs. In particular, consider an environment with some packages installed in /usr, some in /usr/local, some in ~/.local/. Really you want one cache in each location, and then to have dpkg/rpm responsible for updating the /usr cache (this is something they're familiar with, it's isomorphic to stuff like /etc/ld.so.cache), 'sudo pip' responsible for updating the /usr/local cache, and 'pip --user' responsible for updating the ~/.local/ cache. If we go the install hook route instead, then when I do 'pip install --user entry_point_cacher' then there's no way that it'll ever have the permissions to write to /usr, and maybe not to /usr/local either depending on how you want to handle the interaction between 'sudo pip' and ~/.local/ install hooks, so it just... won't actually work as a caching tool. Similarly, it's probably easier to convince conda to regenerate a single standard entry point cache after installing a conda package, than it would be to convince them to run generic wheel install hooks when not even installing wheels. -n -- Nathaniel J. Smith -- https://vorpus.org
On Sat, Oct 21, 2017, at 07:59 AM, Nick Coghlan wrote:
Yeah, here's the gist of what I had in mind regarding the malware problem (i.e. aiming to ensure we don't get all of setup.py's problems back again):> - a package's own install hooks do *not* get called for that package - hooks only run by default inside a virtualenv as a regular user - outside a virtualenv, the default is "hooks don't get run at all"
This one would make caching much less useful for me, because I install a lot of stuff with 'pip install --user'. I'm not really sure how useful this protection is. A malicious package can shadow common module names and command names, so once it's installed, it has an excellent chance of getting to run code, even without hooks. And virtualenvs are not a security boundary - malware installed in a virtualenv is just as bad as malware installed with --user. Moving away from running 'setup.py' to install stuff protects us against packages doing silly things like running pip in a subprocess, but it provides very little protection against deliberately malicious packages. If we're going to do package install hooks, let's not cripple them by trying to introduce security that doesn't really achieve much. Nathaniel raises the point that it may be easier to convince other package managers to regenerate an entry points cache than to call arbitrary Python hooks on install. I guess the key question here is: how many other use cases can we see for package install/uninstall hooks, and how would those work with other packaging systems? Thomas
On 26 October 2017 at 18:33, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
Nathaniel raises the point that it may be easier to convince other package managers to regenerate an entry points cache than to call arbitrary Python hooks on install.
At least for RPM, we have file triggers now, whereby system packages can register a hook to say "Any time another package touches a file under <path of interest> I want to know about it". That means the exact semantics of any RPM integration would likely end up just living in a file trigger, so it wouldn't matter to much whether that trigger was "refresh these predefined caches" or "run any installed hooks based on the defined Python level metadata". However, I expect it would be much easier to define a "optionally export data for caching in a more efficient key value store" API than it would be to define an API for arbitrary pre-/post- [un]install hooks. In particular, a caching API is much easier to *repair*, since the "source of truth" remains the installation DB itself - the cache is just to speed up runtime lookups. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
I agree. The "malware" problem is really a "how do I understand which hooks run in each environment" problem. The hooks could slow down or confuse, frustrate people in ways that were unrelated to any malicious intent. The caching could just be a more efficient, lossless representation of the *.dist/egg-info data model. Would something as simple as a file per sys.path with the 'last modified by installer' date be helpful? You could check those to determine whether your cache was out of date. Another option would be to try to investigate whether the per-sys-path operations that 'import x' has to do anyway can be cached and shared with pkg_resources? On Thu, Oct 26, 2017 at 8:21 AM Nick Coghlan <ncoghlan@gmail.com> wrote:
On 26 October 2017 at 18:33, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
Nathaniel raises the point that it may be easier to convince other package managers to regenerate an entry points cache than to call arbitrary Python hooks on install.
At least for RPM, we have file triggers now, whereby system packages can register a hook to say "Any time another package touches a file under <path of interest> I want to know about it".
That means the exact semantics of any RPM integration would likely end up just living in a file trigger, so it wouldn't matter to much whether that trigger was "refresh these predefined caches" or "run any installed hooks based on the defined Python level metadata".
However, I expect it would be much easier to define a "optionally export data for caching in a more efficient key value store" API than it would be to define an API for arbitrary pre-/post- [un]install hooks. In particular, a caching API is much easier to *repair*, since the "source of truth" remains the installation DB itself - the cache is just to speed up runtime lookups.
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On Thu, Oct 26, 2017, at 03:57 PM, Daniel Holth wrote:
Would something as simple as a file per sys.path with the 'last modified by installer' date be helpful? You could check those to determine whether your cache was out of date. I wonder if we could use the directory mtime for this? It's only really useful if we can be confident that all installer tools will update it.
On 27 October 2017 at 01:45, Thomas Kluyver <thomas@kluyver.me.uk> wrote:
On Thu, Oct 26, 2017, at 03:57 PM, Daniel Holth wrote:
Would something as simple as a file per sys.path with the 'last modified by installer' date be helpful? You could check those to determine whether your cache was out of date.
I wonder if we could use the directory mtime for this? It's only really useful if we can be confident that all installer tools will update it.
There are lots of options for this, and one thing worth keeping in mind is compatibility with the "monolithic system package" model, where the entire preconfigured virtual environment gets archived, and then dropped into place on the target system. In such cases, filesystem level mtimes may change *without* the entry point cache actually being out of date. In that context, it's worth keeping in mind what the actual goals of the cache will be: 1. The entry point cache should ideally reflect the state of installed components in a given execution environment at the time of access. If this is not true, installing a component may require explicit cache invalidation/rebuilding to get things back to a consistent state (similar to the way a call to importlib.invalidate_caches() is needed to reliably see filesystem changes) 2. Checking for available entry points in a given group should be consistently cheap (ideally O(1)), rather than scaling with the number of packages installed or the number of sys.path entries Given those goals, there are a number of different points in time where the cache can be generated, each with different trade-offs between how reliably fresh the cache is, and how frequently you have to rebuild the cache. Option 1: in-memory cache * Pro: consistent with the way importlib caches work * Pro: automatically adjusts to sys.path changes * Pro: will likely be needed regardless to handle per-path-entry caches with other methods * Con: every process incurs at least 1 linear DB read * Con: zero pay-off if you only query one entry point group * Con: requires explicit invalidation to pick up filesystem changes (but can hook into importlib.invalidate_caches()) Option 2: temporary (or persistent) per-user-session cache * Pro: only the first query per path entry per user session incurs a linear DB read * Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even that overhead can be avoided * Pro: sys.path directory mtimes are sufficient for cache invalidation (subject to filesystem timestamp granularity) * Pro: zero elevated privileges needed (cache would be stored in a per-user directory tree) * Con: interprocess locking likely needed to avoid the "thundering herd" cache update problem [1] * Con: if a non-persistent storage location is used, zero benefit over an in-memory cache for throwaway environments (e.g. container startup) * Con: cost of the cache freshness check will still scale linearly with the number of sys.path entries Option 3: persistent per-path-entry cache * Pro: assuming cache freshness means zero runtime queries incur a linear DB read (cache creation becomes an install time cost) * Con: if you don't assume cache freshness, you need option 1 or 2 anyway, and the install time cache just speeds up that first linear read * Con: filesystem access control requires either explicit cache refresh or implicit metadata caching support in installers * Con: sys.path directory mtimes are no longer sufficient for cache invalidation (due to potential for directory relocation) * Con: interprocess locking arguably still needed to avoid the "thundering herd" cache update problem (just between installers rather than runtime processes) Given those trade-offs, I think it would probably make the most sense to start out by exploring a combination of options 1 & 2, and then only explore option 3 based on demonstrated performance problems with a per-user-session caching model. My rationale for that is that even in an image based "immutable infrastructure" deployment model, it's often entirely feasible to preseed runtime caches as part of the build process, and in cases where that *isn't* possible, you're likely also going to have trouble generating per-path-entry caches. Cheers, Nick. [1] https://en.wikipedia.org/wiki/Thundering_herd_problem -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Thu, Oct 26, 2017 at 9:02 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Option 2: temporary (or persistent) per-user-session cache
* Pro: only the first query per path entry per user session incurs a linear DB read * Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even that overhead can be avoided * Pro: sys.path directory mtimes are sufficient for cache invalidation (subject to filesystem timestamp granularity)
Timestamp granularity is a solvable problem. You just have to be careful not to write out the cache unless the directory mtime is sufficiently far in the past, like 10 seconds old, say. (This is an old trick that VCSes use to make commands like 'git status' fast-and-reliable.) This does mean you can get in a weird state where if the directory mtime somehow gets set to the future, then start time starts sucking because caching goes away. Note also that you'll want to explicitly write the observed directory mtime to the cache file, rather than comparing it to the cache file's mtime, to avoid the race condition where the directory gets modified just after we scan it but before we write out the cache.
* Pro: zero elevated privileges needed (cache would be stored in a per-user directory tree) * Con: interprocess locking likely needed to avoid the "thundering herd" cache update problem [1]
Interprocess filesystem locking is going to be far more painful than any problem it might solve. Seriously. At least on Unix, the right approach is to go ahead and regenerate the cache, and then atomically write it to the given place, and if someone else overwrites it a few milliseconds later then oh well. I guess on Windows locking might be OK, given that it has no atomic writes and less gratuitously broken filesystem locking. But you'd still want to make sure you never block when acquiring the lock; if the lock is already taken because someone else is in the middle of updating the cache, then you need to fall back on doing a linear scan. This is explicitly *not* avoiding the thundering herd problem, because it's more important to avoid the "one process got stuck and now everyone else freezes on startup waiting for it" problem.
* Con: if a non-persistent storage location is used, zero benefit over an in-memory cache for throwaway environments (e.g. container startup)
You also have to be careful about whether you have a writeable storage location at all, and if so whether you have the right permissions. (It might be bad if 'sudo somescript.py' leaves me with root-owned cache files in /home/njs/.cache/.) Filesystems are just a barrel of fun.
* Con: cost of the cache freshness check will still scale linearly with the number of sys.path entries
Option 3: persistent per-path-entry cache
* Pro: assuming cache freshness means zero runtime queries incur a linear DB read (cache creation becomes an install time cost) * Con: if you don't assume cache freshness, you need option 1 or 2 anyway, and the install time cache just speeds up that first linear read * Con: filesystem access control requires either explicit cache refresh or implicit metadata caching support in installers * Con: sys.path directory mtimes are no longer sufficient for cache invalidation (due to potential for directory relocation)
Not sure what problem you're thinking of here? In this model we wouldn't be using mtimes for cache invalidation anyway, because it'd be the responsibility of those modifying the directory to update the cache. And if you rename a whole directory, that doesn't affect its mtime anyway?
* Con: interprocess locking arguably still needed to avoid the "thundering herd" cache update problem (just between installers rather than runtime processes)
If two installers are trying to rearrange the same directory at the same time then they can conflict in lots of ways. For the most part people get away with it because doing multiple 'pip install' runs in parallel is generally considered a Bad Idea and unlikely to happen by accident; and if it is a problem then we should add locking anyway (like dpkg and rpm already do), regardless of the cache update part. -n -- Nathaniel J. Smith -- https://vorpus.org
On 27 October 2017 at 18:10, Nathaniel Smith <njs@pobox.com> wrote:
Option 2: temporary (or persistent) per-user-session cache
* Pro: only the first query per path entry per user session incurs a
DB read * Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even
On Thu, Oct 26, 2017 at 9:02 PM, Nick Coghlan <ncoghlan@gmail.com> wrote: linear that
overhead can be avoided * Pro: sys.path directory mtimes are sufficient for cache invalidation (subject to filesystem timestamp granularity)
Timestamp granularity is a solvable problem. You just have to be careful not to write out the cache unless the directory mtime is sufficiently far in the past, like 10 seconds old, say. (This is an old trick that VCSes use to make commands like 'git status' fast-and-reliable.)
Yeah, we just recently fixed a bug related to that in pyc file caching (If you managed to modify and reload a source file multiple times in the same second we could end up missing the later edits. The fix was to check the source timestamp didn't match the current timestamp before actually updating the cached copy on the filesystem)
This does mean you can get in a weird state where if the directory mtime somehow gets set to the future, then start time starts sucking because caching goes away.
For pyc files, we're able to avoid that by looking for cache *inconsistency* without making any assumptions about which direction time moves - as long as the source timestamp recorded in the file pyc doesn't match the source file's mtime, we'll refresh the cache. This is necessary to cope with things like version controlled directories, where directory mtimes can easily go backwards because you switched branches or reverted to an earlier version.
Note also that you'll want to explicitly write the observed directory mtime to the cache file, rather than comparing it to the cache file's mtime, to avoid the race condition where the directory gets modified just after we scan it but before we write out the cache.
* Pro: zero elevated privileges needed (cache would be stored in a per-user directory tree) * Con: interprocess locking likely needed to avoid the "thundering herd" cache update problem [1]
Interprocess filesystem locking is going to be far more painful than any problem it might solve. Seriously. At least on Unix, the right approach is to go ahead and regenerate the cache, and then atomically write it to the given place, and if someone else overwrites it a few milliseconds later then oh well.
Aye, limiting the handling for this to the use of atomic writes is likely an entirely reasonable approach to take.
I guess on Windows locking might be OK, given that it has no atomic writes and less gratuitously broken filesystem locking.
The os module has atomic write support on Windows in 3.x now: https://docs.python.org/3/library/os.html#os.replace So the only problematic case is 2.7 on WIndows, and for that Christian Heimes backported pyosreplace here: https://pypi.org/project/pyosreplace/ (The "may be non-atomic" case is the same situation where it will fail outright on POSIX systems: when you're attempting to do the rename across filesystems. If you stay within the same directory, which you want to do anyway for permissions inheritance and automatic file labeling, it's atomic). But you'd
still want to make sure you never block when acquiring the lock; if the lock is already taken because someone else is in the middle of updating the cache, then you need to fall back on doing a linear scan. This is explicitly *not* avoiding the thundering herd problem, because it's more important to avoid the "one process got stuck and now everyone else freezes on startup waiting for it" problem.
Fair point.
* Con: if a non-persistent storage location is used, zero benefit over an in-memory cache for throwaway environments (e.g. container startup)
You also have to be careful about whether you have a writeable storage location at all, and if so whether you have the right permissions. (It might be bad if 'sudo somescript.py' leaves me with root-owned cache files in /home/njs/.cache/.)
Filesystems are just a barrel of fun.
C'mon, who doesn't enjoy debugging SELinux file labeling problems arising from mounting symlinked host directories into Docker containers running as root internally? :)
* Con: cost of the cache freshness check will still scale linearly with the number of sys.path entries
Option 3: persistent per-path-entry cache
* Pro: assuming cache freshness means zero runtime queries incur a linear DB read (cache creation becomes an install time cost) * Con: if you don't assume cache freshness, you need option 1 or 2 anyway, and the install time cache just speeds up that first linear read * Con: filesystem access control requires either explicit cache refresh or implicit metadata caching support in installers * Con: sys.path directory mtimes are no longer sufficient for cache invalidation (due to potential for directory relocation)
Not sure what problem you're thinking of here? In this model we wouldn't be using mtimes for cache invalidation anyway, because it'd be the responsibility of those modifying the directory to update the cache. And if you rename a whole directory, that doesn't affect its mtime anyway?
Your second sentence is what I meant - whether the cache is still valid or not is less about the mtime, and more about what other actions have been performed. (It's much closer to the locate/updatedb model, where the runtime part just assumes the cache is valid, and it's somebody else's problem to ensure that assumption is reasonably valid)
* Con: interprocess locking arguably still needed to avoid the "thundering herd" cache update problem (just between installers rather than runtime processes)
If two installers are trying to rearrange the same directory at the same time then they can conflict in lots of ways. For the most part people get away with it because doing multiple 'pip install' runs in parallel is generally considered a Bad Idea and unlikely to happen by accident; and if it is a problem then we should add locking anyway (like dpkg and rpm already do), regardless of the cache update part.
Another fair point. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Fri, Oct 27, 2017 at 5:34 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 27 October 2017 at 18:10, Nathaniel Smith <njs@pobox.com> wrote:
On Thu, Oct 26, 2017 at 9:02 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Option 2: temporary (or persistent) per-user-session cache
* Pro: only the first query per path entry per user session incurs a linear DB read * Pro: given persistent cache dirs (e.g. XDG_CACHE_HOME, ~/.cache) even that overhead can be avoided * Pro: sys.path directory mtimes are sufficient for cache invalidation (subject to filesystem timestamp granularity)
Timestamp granularity is a solvable problem. You just have to be careful not to write out the cache unless the directory mtime is sufficiently far in the past, like 10 seconds old, say. (This is an old trick that VCSes use to make commands like 'git status' fast-and-reliable.)
Yeah, we just recently fixed a bug related to that in pyc file caching (If you managed to modify and reload a source file multiple times in the same second we could end up missing the later edits. The fix was to check the source timestamp didn't match the current timestamp before actually updating the cached copy on the filesystem)
This does mean you can get in a weird state where if the directory mtime somehow gets set to the future, then start time starts sucking because caching goes away.
For pyc files, we're able to avoid that by looking for cache *inconsistency* without making any assumptions about which direction time moves - as long as the source timestamp recorded in the file pyc doesn't match the source file's mtime, we'll refresh the cache.
This is necessary to cope with things like version controlled directories, where directory mtimes can easily go backwards because you switched branches or reverted to an earlier version.
Yeah, this is a good idea, but it doesn't address the reason why some systems refuse to update their caches when they see mtimes in the future. The motivation there is that if the mtime is in the future, then it's possible that at some point in the future, the mtime will match the current time, and then if the directory is modified at that moment, the cache will become silently invalid. It's not clear how important this really is; you have to get somewhat unlucky, and if you're seeing timestamps from the future then timekeeping has obviously broken down somehow and nothing based on mtimes can be reliable without reliable timekeeping. (For example, even if the mtime seems to be in the past, the clock could get set backwards and now the same mtime is in the future after all.) But that's the reasoning I've seen.
The os module has atomic write support on Windows in 3.x now: https://docs.python.org/3/library/os.html#os.replace
So the only problematic case is 2.7 on WIndows, and for that Christian Heimes backported pyosreplace here: https://pypi.org/project/pyosreplace/
(The "may be non-atomic" case is the same situation where it will fail outright on POSIX systems: when you're attempting to do the rename across filesystems. If you stay within the same directory, which you want to do anyway for permissions inheritance and automatic file labeling, it's atomic).
I've never been able to tell whether this is trustworthy or not; MS documents the rename-across-filesystems case as an *example* of a case where it's non-atomic, and doesn't document any atomicity guarantees either way. Is it really atomic on FAT filesystems? On network filesystems? (Do all versions of CIFS even give a way to express file replacement as a single operation?) But there's folklore saying it's OK... I guess in this case atomicity wouldn't be that crucial anyway though.
Option 3: persistent per-path-entry cache
* Pro: assuming cache freshness means zero runtime queries incur a linear DB read (cache creation becomes an install time cost) * Con: if you don't assume cache freshness, you need option 1 or 2 anyway, and the install time cache just speeds up that first linear read * Con: filesystem access control requires either explicit cache refresh or implicit metadata caching support in installers * Con: sys.path directory mtimes are no longer sufficient for cache invalidation (due to potential for directory relocation)
Not sure what problem you're thinking of here? In this model we wouldn't be using mtimes for cache invalidation anyway, because it'd be the responsibility of those modifying the directory to update the cache. And if you rename a whole directory, that doesn't affect its mtime anyway?
Your second sentence is what I meant - whether the cache is still valid or not is less about the mtime, and more about what other actions have been performed. (It's much closer to the locate/updatedb model, where the runtime part just assumes the cache is valid, and it's somebody else's problem to ensure that assumption is reasonably valid)
Yeah. Which is probably the big issue with your third approach: it'll probably work great if all installers are updated to properly manage the cache. Explicit cache invalidation is fast and reliable and avoids all these mtime shenanigans... if everyone implements it properly. But currently there's lots of software that feels free to randomly dump stuff into sys.path and doesn't know about the cache invalidation thing (e.g. old versions of pip and setuptools), and that's a disaster in a pure explicit invalidation model. I guess in practice the solution for the transition period would be to also store the mtime in the cache, so you can at least detect with high probability when someone has used a legacy installer, and yell at them to stop doing that. Though this might then cause problems if people do stuff like zip up their site-packages and then unzip it somewhere else, updating the mtimes in the process. -n -- Nathaniel J. Smith -- https://vorpus.org
participants (12)
-
Alex Grönholm
-
Brett Cannon
-
Daniel Holth
-
Donald Stufft
-
Doug Hellmann
-
Marius Gedminas
-
Nathaniel Smith
-
Nick Coghlan
-
Paul Moore
-
Thomas Kluyver
-
Tres Seaver
-
Wes Turner