Hi all, Again trying to split out some more focused discussion from the big thread about sdists... One big theme there has been the problem of "sources of truth": e.g. in current sdists, there is a PKG-INFO file that has lots of static metadata in it, but because the "real" version of that metadata is in setup.py, everyone ignores PKG-INFO. A clear desideratum for a new sdist format is that we avoid this problem, by having static metadata that is actually trustworthy. I see two fundamentally different strategies that we might use to accomplish this. In time honored mailing list tradition, these are of course the one that I hear other people advocating and the one that I like ;-). The first strategy is: sdists and the wheels they generate logically share the same metadata; so, we need some mechanism to enforce that whatever static metadata is in the sdist will match the metadata in the resulting wheel. (The wheel might potentially have additional metadata beyond what is in the sdist, but anything that overlaps has to match.) An open question is what this mechanism will look like -- if everyone used distutils/setuptools, then we could write the code in distutils/setuptools so that when it generated wheel metadata, it always copied it directly out of the sdist metadata (when present). But not everyone will use distutils/setuptools, because distutils delenda est. So we need some mechanism to statically analyze an arbitrary build system and prove things about the data it outputs. Which sounds... undecideable. Or we could have some kind of after-the-fact enforcement mechanism, where tools like pip are required -- as the last step when building a wheel from an sdist -- to double-check that all the metadata matches, and if it doesn't then they produce a hard error and refuse to continue. But even this wouldn't necessarily guarantee that PyPI can trust the metadata, since PyPI is not going to run this enforcement mechanism... The second strategy is: put static metadata in both sdists and wheels, but treat them as logically distinct things: the static metadata in sdists is the source of truth for information *about that sdist* (sdist name, sdist version, sdist description, sdist authors, etc.), and the static metadata in wheels is the source of truth for information about that wheel, but we think of these as distinct things and don't pretend that we can statically guarantee that they will match. I mean, in practice, they basically always will match. But IMO making this distinction in our minds leads to clearer thinking. When PyPI needs to know the name/version/description for an sdist, it can still do that; and since we've lowered our ambitions to only finding the sdist name instead of the wheel name, it can actually do it reliably in a totally static way, without having to run arbitrary code to validate this. OTOH pip will always have to be prepared to handle the possibility of mismatch between what it was expecting based on the sdist metadata and what it actually got after building it, so we might as well acknowledge that in our mental model. One potential advantage of this approach is that we might be able to talk ourselves into trusting the existing PKG-INFO as providing static metadata about the sdist, and thus PyPI at least could start trusting it for things like the "description" field, and if we define a new sdist format then it would be possible to generate its static metadata from current setup.py files (e.g. by modifying setuptools's sdist command). Contrast this with the other approach, where getting any kind of static source-of-truth would require rewriting almost all existing setup.py files. The challenge, of course, is that there are a few places where pip actually does need to know something about wheels based on examining an sdist -- in particular name and version and (controversially) dependencies. But this can/should be addressed explicitly, e.g. by writing down a special rule about the name and version fields. -n -- Nathaniel J. Smith -- http://vorpus.org
On 12 October 2015 at 18:36, Nathaniel Smith <njs@pobox.com> wrote:
Hi all,
Again trying to split out some more focused discussion from the big thread about sdists...
One big theme there has been the problem of "sources of truth": e.g. in current sdists, there is a PKG-INFO file that has lots of static metadata in it, but because the "real" version of that metadata is in setup.py, everyone ignores PKG-INFO.
A clear desideratum for a new sdist format is that we avoid this problem, by having static metadata that is actually trustworthy. I see two fundamentally different strategies that we might use to accomplish this. In time honored mailing list tradition, these are of course the one that I hear other people advocating and the one that I like ;-).
The first strategy is: sdists and the wheels they generate logically share the same metadata; so, we need some mechanism to enforce that
This is false: they don't share the same metadata. Some portions are the same, but deps, supported platforms, those will differ (and perhaps more than that). In particular, an sdist doesn't have a dependency on an ABI, and a wheel doesn't have a dependency on an API. Some APIs are ABIs (approximately true for all pure Python packages, for instance), but some are not (numpy).
The second strategy is: put static metadata in both sdists and wheels, but treat them as logically distinct things: the static metadata in sdists is the source of truth for information *about that sdist* (sdist name, sdist version, sdist description, sdist authors, etc.), and the static metadata in wheels is the source of truth for information about that wheel, but we think of these as distinct things and don't pretend that we can statically guarantee that they will match. I mean, in practice, they basically always will match.
The analgous current data won't match for pbr using packages when we fix https://bugs.launchpad.net/pbr/+bug/1502692 (older pip's don't support PEP-426 environment markers, but don't error when they are used either, leading to silent failure to install dependencies). Now, you might say 'hey, but the new shiny will support markers from day one'. Well the problem is backwards compat: we're going to have future things that change, and the more we split things out the more the changes are likely to need skewed results like this approach to deal with it. ...
the sdist name instead of the wheel name, it can actually do it
but the sdist and the wheel have to have the same name- or do you mean the filename on disk, vs the distribution name?
reliably in a totally static way, without having to run arbitrary code to validate this. OTOH pip will always have to be prepared to handle the possibility of mismatch between what it was expecting based on the sdist metadata and what it actually got after building it, so we might as well acknowledge that in our mental model.
One potential advantage of this approach is that we might be able to talk ourselves into trusting the existing PKG-INFO as providing static metadata about the sdist, and thus PyPI at least could start trusting it for things like the "description" field, and if we define a new
The challenge is the 40K broken packages up there on PyPI. Basically pip has a bugfix for any of: sdists built using distutils sdists built using random build systems that don't understand what an sdist is (e.g. automake) sdists built using versions of setuptools that had a bug in this area There is no corrective mechanism for broken packages other than route-around-it-while-you-ask-the-author-to-upload-a-fix. So I think to tackle the 'please trust the metadata in the sdist' problem, one needs to have a graceful ramp-up of that trust with robust backoff mechanisms that don't involve 50% of PyPI users hating on that one old project in the corner everyone has a dep on but that is actually moribund and not doing uploads. I can imagine several such routes, including a crowdsourced blacklist - but its going to be (like we're dealing with with the automatic wheel cache already) years of bug reports until things age out.
sdist format then it would be possible to generate its static metadata from current setup.py files (e.g. by modifying setuptools's sdist command). Contrast this with the other approach, where getting any kind of static source-of-truth would require rewriting almost all existing setup.py files.
We already generate static metadata from current setup.py files: setup.py egg_info does precisely that. There, bug fixed ;).
The challenge, of course, is that there are a few places where pip actually does need to know something about wheels based on examining an sdist -- in particular name and version and (controversially) dependencies. But this can/should be addressed explicitly, e.g. by writing down a special rule about the name and version fields.
I'm sorry, I don't follow. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Converged Cloud
On Sun, Oct 11, 2015 at 11:00 PM, Robert Collins <robertc@robertcollins.net> wrote:
On 12 October 2015 at 18:36, Nathaniel Smith <njs@pobox.com> wrote: [...]
the sdist name instead of the wheel name, it can actually do it
but the sdist and the wheel have to have the same name- or do you mean the filename on disk, vs the distribution name?
I mean the distribution name - there's no way to guarantee that building foo-1.0.zip won't spit out bar-7.4.whl, where by "no way" I mean "it's literally undecideable". I mean, if someone actually did this it would be super weird and we would all shun them, but our code and specs still need to be prepared for the possibility. IIUC this is why PyPI can't trust PKG-INFO: 99.9% of the time the metadata in PKG-INFO matches what you will get when you run setup.py, but right now PyPI wants to know what setup.py will do, and there's no way to know if it will be the same as what PKG-INFO says, so it just doesn't trust PKG-INFO. OTOH if we redefine PyPI's goal as being, figure out what's in PKG-INFO (or whatever replaces it), and declare that it's okay (for PyPI's purposes) if that doesn't match what the build system will eventually do, then that's a viable way forward.
reliably in a totally static way, without having to run arbitrary code to validate this. OTOH pip will always have to be prepared to handle the possibility of mismatch between what it was expecting based on the sdist metadata and what it actually got after building it, so we might as well acknowledge that in our mental model.
One potential advantage of this approach is that we might be able to talk ourselves into trusting the existing PKG-INFO as providing static metadata about the sdist, and thus PyPI at least could start trusting it for things like the "description" field, and if we define a new
The challenge is the 40K broken packages up there on PyPI. Basically pip has a bugfix for any of: sdists built using distutils sdists built using random build systems that don't understand what an sdist is (e.g. automake) sdists built using versions of setuptools that had a bug in this area
There is no corrective mechanism for broken packages other than route-around-it-while-you-ask-the-author-to-upload-a-fix.
IIUC what PyPI wants to do with PKG-INFO is read out stuff like the description and trove classifiers fields. Are there really 40K sdists on PyPI that have PKG-INFO files and where those files contain incorrect descriptions and so forth? I mean, obviously someone would have to check :-) But it seems unlikely, since almost everyone uploads by running 'sdist upload' or twine or something similarly automated.
So I think to tackle the 'please trust the metadata in the sdist' problem, one needs to have a graceful ramp-up of that trust with robust backoff mechanisms that don't involve 50% of PyPI users hating on that one old project in the corner everyone has a dep on but that is actually moribund and not doing uploads. I can imagine several such routes, including a crowdsourced blacklist - but its going to be (like we're dealing with with the automatic wheel cache already) years of bug reports until things age out.
sdist format then it would be possible to generate its static metadata from current setup.py files (e.g. by modifying setuptools's sdist command). Contrast this with the other approach, where getting any kind of static source-of-truth would require rewriting almost all existing setup.py files.
We already generate static metadata from current setup.py files: setup.py egg_info does precisely that. There, bug fixed ;).
I'm pretty sure that merely making it so 'setup.py sdist' created a file that contained the output from egg_info would not solve the current problem. That's pretty much exactly what the existing PKG-INFO *is*, isn't it? Yet apparently no-one trusts it.
The challenge, of course, is that there are a few places where pip actually does need to know something about wheels based on examining an sdist -- in particular name and version and (controversially) dependencies. But this can/should be addressed explicitly, e.g. by writing down a special rule about the name and version fields.
I'm sorry, I don't follow.
E.g., we can document that if you have a sdist foo-1.0, then pip and similar tools will expect this to generate a foo-1.0 wheel (but be prepared to do something sensible if this doesn't happen, like give an error message or whatever). That's really all pip needs, right? -n -- Nathaniel J. Smith -- http://vorpus.org
On 12 October 2015 at 08:23, Nathaniel Smith <njs@pobox.com> wrote:
I mean the distribution name - there's no way to guarantee that building foo-1.0.zip won't spit out bar-7.4.whl, where by "no way" I mean "it's literally undecideable". I mean, if someone actually did this it would be super weird and we would all shun them, but our code and specs still need to be prepared for the possibility.
But the whole point of a spec is to *make* that guarantee. Pip doesn't expect to ever work with random code that can do literally anything, that's silly. Pip works with code that conforms to a spec. At the moment, the spec is "it does what setuptools/distutils does", which is a lousy spec because (a) nobody else can implement the spec without reading the huge mess that is distutils/setuptools, (b) it's too broad (we don't *really* need every corner case that distutils covers) and (c) distutils/setuptools don't satisfy the needs of a number of communities, notably the scientific community. So we want to define a more usable spec for the future. Pip will always have to deal with backward compatibility, and all of the hacks that implies, but we can at least say "if a project declares that it follows the new spec, we can rely on certain things". One of those things we can rely on is that building foo-1.0.zip won't spit out bar-7.4.whl - that would be part of the contract of that spec. The debate is over on the one hand, we want to be able to rely on when writing packaging tools, vs what flexibility we want to have when writing build tools. But it's not over ignoring the other use cases, it's about agreeing a *contract* that satisfies the needs of *both* packaging tools and build tools. At the moment, the dial is 100% over to the packaging side - builders have zero flexibility, the rule is "use distutils and setuptools". That doesn't feel like zero flexibility, because distutils/setuptools let you do a lot of things. But when you do finally hit the limits you're stopped cold. It also is a bad choice for packaging tools, because distutils and setuptools are a dreadful API for automation. So nobody wins. Two suggestions that have been around for a long while that give more flexibility to build tools while at the same time giving packaging tools a better interface to work with are: 1. Declarative setup - you have to put your metadata in file X in format Y, and it's static. But when you do, your build tool can do whatever it wants as long as it spits out binaries that confirm to your declared metadata. The downside to this is that it's totally static, and developers don't like that ("what if I want to generate my version number from my VCS tags?") 2. A fixed API - we document a set of commands that packaging tools can use and build tools have to provide. This is where we talk about documenting the distutils commands that pip relies on ("setup.py egg-info", "setup.py install", "setup.py develop"...) This one usually falls down because nobody likes the idea of writing a "dummy" setup.py that translates the interface, and because no-one has done anything more with this idea than suggest documenting the setuptools commands that pip *currently* uses, even though these are probably a lousy API design. It also does very little for packaging tools *other* than pip (for example, it does nothing for PyPI, which cannot accept an API that requires running user-supplied code). The key drivers here have been about defining something that packaging tools can use effectively, and build tools can cope with being required to provide. While not constraining *either* type of tool beyond the minimum needed for the contract. Paul
participants (3)
-
Nathaniel Smith
-
Paul Moore
-
Robert Collins