[Distutils] Sources of truth

12 Oct 2015

      Hi all,

Again trying to split out some more focused discussion from the big
thread about sdists...

One big theme there has been the problem of "sources of truth": e.g.
in current sdists, there is a PKG-INFO file that has lots of static
metadata in it, but because the "real" version of that metadata is in
setup.py, everyone ignores PKG-INFO.

A clear desideratum for a new sdist format is that we avoid this
problem, by having static metadata that is actually trustworthy. I see
two fundamentally different strategies that we might use to accomplish
this. In time honored mailing list tradition, these are of course the
one that I hear other people advocating and the one that I like ;-).

The first strategy is: sdists and the wheels they generate logically
share the same metadata; so, we need some mechanism to enforce that
whatever static metadata is in the sdist will match the metadata in
the resulting wheel. (The wheel might potentially have additional
metadata beyond what is in the sdist, but anything that overlaps has
to match.) An open question is what this mechanism will look like --
if everyone used distutils/setuptools, then we could write the code in
distutils/setuptools so that when it generated wheel metadata, it
always copied it directly out of the sdist metadata (when present).
But not everyone will use distutils/setuptools, because distutils
delenda est. So we need some mechanism to statically analyze an
arbitrary build system and prove things about the data it outputs.
Which sounds... undecideable. Or we could have some kind of
after-the-fact enforcement mechanism, where tools like pip are
required -- as the last step when building a wheel from an sdist -- to
double-check that all the metadata matches, and if it doesn't then
they produce a hard error and refuse to continue. But even this
wouldn't necessarily guarantee that PyPI can trust the metadata, since
PyPI is not going to run this enforcement mechanism...

The second strategy is: put static metadata in both sdists and wheels,
but treat them as logically distinct things: the static metadata in
sdists is the source of truth for information *about that sdist*
(sdist name, sdist version, sdist description, sdist authors, etc.),
and the static metadata in wheels is the source of truth for
information about that wheel, but we think of these as distinct things
and don't pretend that we can statically guarantee that they will
match. I mean, in practice, they basically always will match. But IMO
making this distinction in our minds leads to clearer thinking. When
PyPI needs to know the name/version/description for an sdist, it can
still do that; and since we've lowered our ambitions to only finding
the sdist name instead of the wheel name, it can actually do it
reliably in a totally static way, without having to run arbitrary code
to validate this. OTOH pip will always have to be prepared to handle
the possibility of mismatch between what it was expecting based on the
sdist metadata and what it actually got after building it, so we might
as well acknowledge that in our mental model.

One potential advantage of this approach is that we might be able to
talk ourselves into trusting the existing PKG-INFO as providing static
metadata about the sdist, and thus PyPI at least could start trusting
it for things like the "description" field, and if we define a new
sdist format then it would be possible to generate its static metadata
from current setup.py files (e.g. by modifying setuptools's sdist
command). Contrast this with the other approach, where getting any
kind of static source-of-truth would require rewriting almost all
existing setup.py files.

The challenge, of course, is that there are a few places where pip
actually does need to know something about wheels based on examining
an sdist -- in particular name and version and (controversially)
dependencies. But this can/should be addressed explicitly, e.g. by
writing down a special rule about the name and version fields.

-n

-- 
Nathaniel J. Smith -- http://vorpus.org

[Distutils] Sources of truth

Nathaniel Smith