[Distutils] Towards a simple and standard sdist format that isn't intertwined with distutils

Donald Stufft donald at stufft.io
Sun Oct 4 22:45:05 CEST 2015

On October 4, 2015 at 2:22:51 PM, Nathaniel Smith (njs at pobox.com) wrote:
> > I guess to make progress in this conversation I need some more 
> detailed explanations. I totally get that there's a long history 
> of thought and conversations behind the various assertions 
> here like "a sdist is fundamentally different from a VCS checkout", 
> "there must be a 1-1 mapping between sdists and wheels", "pip 
> needs sdists that have full wheel metadata in static form", and 
> I'm barging in from the outside with no context, but I literally 
> have no idea why the specific design features you're asking for 
> are desirable or even viable. Right now if I were to try and write 
> the PEP you're asking for, then the rationale section would just 
> be "because Donald said so" over and over :-). I couldn't write 
> the motivation section, because I don't know any problems that 
> the PEP you're describing would fix for me as a package author 
> (which doesn't mean they don't exist, but!).

I don't mind going into more details! I'll do the things you specifically
mentioned and then if there is other things, feel free to bring them up too. I
should also mention, that these are my opinions from my experiences with the
toolchain and ecosystem, others may agree or disagree with me. I have strong
opinions, but that doesn't make them immutable laws of the universe, although
"because Donald said so" sounds like a pretty good answer to me ;)

"a sdist is fundamentally different from a VCS checkout"

This one I have a hard time trying to explain. They are focused on different
things. With an sdist you need to have a project name, a version, a list of
files, things like that. The use cases and needs for each "phase" are
different. For instance, in a VCS checkout you can derrive the list of files or
the version by asking the VCS but a sdist doesn't have a VCS so it has to have
that baked into it. A more C centric example, is that you often times have
something like autogen.sh in a C project's VCS, but you don't have the output
of that checked into the VCS, however when you prepare a tarball for
distribution you run autogen.sh and then include the output there.

There are other differences too, in a VCS we don't really need the ability to
statically read any metadata except for build dependencies and how to invoke
the build tool. Most everything else can be dynamically configured because
you're not distributing that. However in a sdist, we need as much of the
metadata to be static as possible. Something like PyPI needs to be able to
inspect any of the files uploaded to it (sdist, wheels, etc) for certain
information and anything that can't be statically and safely read from it might
as well not even exist as far as PyPI is concerned.

We currently have the situation where we have a single file that is used for
all phases of the process, dev (``setup.py develop`` & ``setup.py sdist``),
building of a wheel (``setup.py bdist_wheel``) and even installation sometimes
(``setup.py install``). Throughout this there are a lot of common problems
where some author tried to optimize their ``setup.py`` for their development
use cases and broke it for the other cases. An example of this is version
handling, where it's not unusual for someone's first forray into attempting to
deduplication version involves importing their thing (which works fine on their
machine) and passing it into the setup kwargs. This simple thing would
generally work just fine if the output of ``setup.py sdist`` produced static
metadata and ``setup.py`` was no longer being used.

This also becomes expressed in what interfaces you give to the toolchain at
each "phase". It's important for something inside of a VCS checkout to be able
to be written by human beings. This leads to wanting to use formats like INI
(which is ugly) or something like TOML or YAML or some other nice, human
friendly format. These formats are great for humans to write and for humans to
read but are not particularly great as data interchange formats. Looking at
something like JSON, msgpack, etc are far better for data interchange for
computers to talk to other computers, but are not great for humans to write,
edit, or even really read in many cases. If we go back to distutils2, you can
see this effect happening there, they had two similar keywords arguments in
their setup.cfg statements, description and description-file, these both did
the same things, but just pulled from different sources (inline or via a file)
forcing every tool in the chain to have to support both of these options even
though it could have easily made an sdist that was distinct from the VCS code
and simplified code there.

I see the blurring of lines between the various phases of a package one of the
fundamental flaws of distutils and setuptools.

"there must be a 1-1 mapping between sdists and wheels"

This has technical and social reasons.

In the techincal side, the 1-1 mapping between sdists and wheels (and all other
bdists) is an assumption baked into all of the tools. From PyPI's enforcement
mechanisms, to pip's caching, to things like devpi and the such breaking this
assumption will break a lot of code. This is all code and code is not immutable
so we could of course change that, however we wouldn't be able to rely on the
fact that we've fixed that assumption for many years (probably at least 5 at
the earliest, 10+ is more likely).

The social side is a bit more interesting though. In Debian, end users almost
*never* actually interact with source packages and in near 100% of the time
they are interacting soley with built packages (in fact, unlike Python, you
have to manually build a deb before you can even attempt to install something).
There really aren't "source packages" in Debian, just sources that happen to
produce a Debian package. In Python land, a source package is still a package
and people have expectations around that, I think people would be very confused
if a sdist "foo-1.0.tar.gz" could produce a wheel "bar-3.0.whl".

In addition, systems like Debian don't really try to protect against a
malicious DD at all. Things like "prevent foo from claiming to be bar" are
enforced via societal conventions and the fact that it is not an open repo and
there are gatekeepers keeping everything in place. On the flip side, we let
anyone upload to PyPI and rely on things like ACLs to secure things. This means
that we need to know ahead of time what names a package is going to produce.
The simpliest mechanism for this is to enforce a 1:1 mapping between sdist and
wheel because that is an immutable property and easy to understand. I could
possibly envision something that allowed this, but it would require a project
to explicitly declare up front what names it will produce, and require
registering those names with PyPI before you could upload a sdist that could
produce those named wheels.

Ultimately, I don't think the very minor benefits are worth the additional
complexity and pain of trying to adapt all of the tooling and human
expectations to this.

"pip needs sdists that have full wheel metadata in static form"

I think I could come around to the idea that some metadata doesn't make sense
for a sdist, and that it really needs to be a part of wheels but not a part
of sdist. I think that the argument needs to be made in the other direction
though, we should assume that all metadata will be included as part of the
sdist and then make an argument for why each particular piece of metadata
is Wheel specific not specific to a particular version of a project.

Things like name, version, description, classifiers, etc are easily able to be
classified into specific to a particular (name, version) tuple. Other things
like "Python ABI" are easily able to be classified into specific to a
particular wheel.

Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

More information about the Distutils-SIG mailing list