[Distutils] Towards a simple and standard sdist format that isn't intertwined with distutils

Paul Moore p.f.moore at gmail.com
Sun Oct 4 22:02:44 CEST 2015


Let me see if I can help clarify, so it's not just Donald who says so :-)

It does feel as if we're trying to explain a lot of things that
"everybody knows". Clearly not everybody knows, as you don't, but what
we're trying to clarify here is the de facto realities of how sdists
work, and how people expect them to work. Unfortunately, there's an
awful lot of things in the packaging ecosystem that are defined by
existing practice, and traditionally haven't been formally documented.
I'm sure it feels as if we're just repeatedly saying "it has to be
like that" - but in truth, it's more that what we're saying is the
definition of a sdist, as established by existing practice. I wish we
could point you to a formal definition of the requirements, but
unfortunately they've never been written down. With luck, one of the
outcomes here will be that someone will record what a sdist is - but
we need to reflect current reality, and not end up reusing the term
"sdist" to mean something different from what people currently use it
for.

On 4 October 2015 at 19:22, Nathaniel Smith <njs at pobox.com> wrote:
> "a sdist is fundamentally different from a VCS checkout",

Specifically, a sdist is built by the packaging tools - at the moment,
by "setup.py sdist", but in future by whatever tool(s) may replace
distutils/setuptools. So a sdist has a defined format, and we can
mandate certain things about it. In particular, we can require files
to be present which are in tool-friendly formats, because the tools
will build them. On the other hand, a VCS checkout is fundamentally
built by a human, for use by humans. File formats need to be
human-editable, we have to be prepared to work with constraints
imposed by workflows and processes *other* than Python packaging
tools. So we have much less ability to dictate the format.

Your proposal mandates a single directory "owned" by the packaging
ecosystem, which follows the git/hg/subversion model, so it's
lightweight and low-risk. But you still cant realistically ask the
user to maintain package data in (for example) a JSON file in that
directory.

> "there must be a 1-1 mapping between sdists and wheels",

The fundamental reason is one I know I've mentioned here before - pip
implements "pip install <sdist>" by first building a wheel and then
installing it. If a sdist generates two wheels, how will pip know
which one to install? Also, users expect "pip wheel <sdist>" to
produce the wheel corresponding to the sdist. You're proposing to
change that expectation - the onus is on you to justify that change.

You need to consider backward compatibility in the wider sense here
too - right now, there *is* a one-to-one mapping between a sdist and a
wheel. If you want to change that you need to justify it, it's not
enough just to claim that no-one has come up with a persuasive
argument to keep things as they are. Change is not a bad thing, and
"because we've always done it that way" is not a good argument, but
change needs to be justified even so.

> "pip needs sdists that have full wheel metadata in static form"

I assume here you're now OK with the distinction between a sdist and a
VCS checkout? If you still think we're saying that pip needs static
metadata in *VCS checkouts* then please review the comments already
made about the difference between a sdist and a VCS checkout.

But basically, a sdist is a tool-generated archive that captures the
state of the project and allows for *reproducible* builds of that
project. If your understanding of what a sdist is differs from this,
we need to stop and agree on terminology before going any further. I
will concede that https://packaging.python.org/en/latest/glossary/
doesn't mention the point that a sdist needs to provide reproducible
builds. But that's certainly how sdists are used at present, and how
people expect them to work. Certainly, if I lost the wheel I'd built
from a sdist, I'd expect to just rebuild it from the sdist and get the
same wheel.

Pip needs metadata to do dependency resolution. This includes project
name, version, and dependency information. We could debate about
whether *full* metadata is needed, but I'm not sure what the point is.
Once you are recording the stuff that pip needs, why *not* record
everything? There are other tools (and ad-hoc scripts) that would
benefit from having the full metadata, so why would you make it harder
for them to work? You claim that you want to keep your options open -
but to me, it's more important to leave the *user's* options open. If
we don't provide certain values, a user who needs that data has to
propose a change to the format, wait for it to be implemented, and
even then they can't rely on it until all projects move to the new
format. Better to just require everything from the start, then users
can get at whatever they need.

As far as why the metadata should be static, the current sdist format
does actually include static metadata, in the PKG-INFO file. So again
we have a case where it's up to you to justify the backward
compatibility break. But it's a little less clear-cut here, because
you are proposing a new sdist format, so you've already argued for a
break with the old format. Also the old format is not typically
introspected, it's just used to unpack and run setup.py. So you can
reasonably argue that the current state of affairs is irrelevant.

However, we're talking here about whether the metadata should be
statically available, or dynamically generated. The key point here is
that dynamic metadata requires the tool (pip, my one-off script,
whatever) to *run arbitrary code* in order to get the metadata. OK,
with signing we can ensure that it's *trusted* code, but it still
could do anything the project author wanted, and we can make no
assumptions about what it does. That makes a tool's job much harder. A
common bug report for pip is users finding that their installs fail,
because setup.py requires numpy to be installed in order to run, and
yet pip is running setup.py egg-info precisely to find out what the
requirements are. We tell the user that the setup.py is written
incorrectly, and they should install numpy and retry the install, but
it's not a good user experience. And from a selfish point of view,
users blame *pip* for the consequences of a project whose code to
generate the metadata is buggy. Those bug reports are a drain on the
time of the pip developers, as well as a frustrating experience for
the users.

If you want to argue that a VCS checkout, or development directory,
needs to generate metatata dynamically, I won't argue. That's fine.
But the sdist is a tool-generated snapshot of a *specific* release of
a project (maybe "the release I made at 1:15 today for my test build",
but still a specific build) and it should be perfectly possible to
capture the dynamically generated metadata from the VCS checkout and
store it in the sdist when it is built.

If you feel that there is metadata that cannot be stored statically in
the sdist, could you please give a specific example? But do remember
that a sdist is intended as a *snapshot* of a VCS checkout that can be
used to reproducibly build the project - so "the version number needs
to include the time of the build" isn't a valid example.

Paul


More information about the Distutils-SIG mailing list