[Distutils] Towards a simple and standard sdist format that isn't intertwined with distutils

Fri Oct 2 22:19:56 CEST 2015

On Fri, Oct 2, 2015 at 4:58 AM, Donald Stufft <donald at stufft.io> wrote:
> On October 2, 2015 at 12:54:03 AM, Nathaniel Smith (njs at pobox.com) wrote:
>> > We realized that actually as far as we could tell, it wouldn't
>> be that
>> hard at this point to clean up how sdists work so that it would be
>> possible to migrate away from distutils. So we wrote up a little
>> draft
>> proposal.
>>
>> The main question is, does this approach seem sound?
>
> I've just read over your proposal, but I've also just woken up so I might be
> a little slow still! After reading what you have, I don't think that this
> proposal is the right way to go about improving sdists.
>
> The first thing that immediately stood out to me, is that it's recommending
> that downstream redistributors like Debian, Fedora, etc utilize Wheels instead
> of the sdist to build their packages from. However, that is not really going to
> fly with most (all?) of the downstream redistributors. Debian for instance has
> policy that requires the use of building all of it's packages from Source, not
> from anything else and Wheels are not a source package. While it can
> theoretically work for pure python packages, it quickly devolves into a mess
> when you factor in packages that have any C code what so ever.

I think this was addressed downthread -- the idea would be that Debian
would build from sdist, with a two step process: convert sdist to
wheels, repack wheels into binary .deb.

> Overall, this feels more like a sidegrade than an upgrade. One major theme
> throughout of the PEP is that we're going to push to rely heavily on wheels as
> the primary format of installation. While that works well for things like
> Debian, I don't think it's going to work as wheel for us. If we were only
> distributing pure python packages, then yes absolutely, however given that we
> are not, we have to worry about ABI issues. Given that there is so many
> different environments that a particular package might be installed into, all
> with different ABIs we have to assume that installing from source is still
> going to be a primary path for end users to install and that we are never going
> to have a world where we can assume a Wheel in a repository.
>
> One of the problems with the current system, is that we have no mechanism by
> which to determine dependencies of a source distribution without downloading
> the file and executing some potentially untrusted code. This makes dependency
> resolution harder and much much slower than if we could read that information
> statically from a source distribution. This PEP doesn't offer anything in the
> way of solving this problem.

What are the "dependencies of a source distribution"? Do you mean the
runtime dependencies of the wheels that will be built from a source
distribution?

If you need that metadata to be statically in the sdist, then you
might as well give up now because it's simply impossible.

As the very simplest example, every package that uses the numpy C API
gets a runtime dependency on "numpy >= [whatever version happened to
be installed on the *build* machine]". There are plenty of more
complex examples too (e.g. ones that involve build/configure-time
decisions about whether to rely on particular system libraries, or
build/configure-time decisions about whether particular packages
should even be built).

For comparison, here's the Debian source package metadata:
    https://www.debian.org/doc/debian-policy/ch-controlfields.html#s-debiansourcecontrolfiles
Note that the only mandatory fields are format version / package name
/ package version / maintainer / checksums. The closest they come to
making promises about the built packages are the Package-List and
Binary fields which provide a optional hint about what binary packages
will be built, and are allowed to contain lies (e.g. they explicitly
don't guarantee that all the binary packages named will actually be
produced on every architecture). The only kind of dependencies that a
source package can declare are build-depends.

> To a similar tune, this PEP also doesn't make it possible to really get at
> any other metadata without executing software. This makes it pratically
> impossible to safely inspect an unknown or untrusted package to determine what
> it is and to get information about it. Right now PyPI relies on the uploading
> tool to send that information alongside of the file it is uploading, but
> honestly what it should be doing is extracting that information from within the
> file. This is sort of possible right now since distutils and setuptools both
> create a static metadata file within the source distribution, but we don't rely
> on that within PyPI because that information may or may not be accurate and may
> or may not exist. However the twine uploading tool *does* rely on that, and
> this PEP would break the ability for twine to upload a package without
> executing arbitrary code.

Okay, what metadata do you need? We certainly could put name / version
kind of stuff in there. We left it out because we weren't sure what
was necessary and it's easy to add later, but anything that's needed
by twine fits neatly into the existing text saying that we should
"include extra metadata in source distributions if it helps solve
specific problems that are unique to distribution" -- twine uploads
definitely count.

> Overall, I don't think that this really solves most of the foundational
> problems with the current format. Largely it feels that what it achieves is
> shuffling around some logic (you need to create a hook that you reference from
> within a .cfg file instead of creating a setuptools extension or so) but

numpy.distutils is the biggest distutils/setuptools extension around,
and everyone involved in maintaining it wants to kill it with fire
:-). That's a problem...

> without fixing most of the problems. The largest benefit I see to switching to
> this right now is that it would enable us to have build time dependencies that
> were controlled by pip rather than installed implicitly via the execution of
> the setup.py.

Yes, this problem means that literally every numerical python package
currently has a broken setup.py.

> That doesn't feel like a big enough benefit to me to do a mass
> shakeup of what we recommend and tell people to do. Having people adjust and
> change and do something new requires effort, and we need something to justify
> that effort to other people and I don't think that this PEP has something we
> can really use to justify that effort.

The end-user adjustment is teaching people to switch to always using
pip to install packages -- this seems like something we will certainly
do sooner or later, so we might as well get started.

And it's already actually the right thing to do -- if you use
'setup.py install' then you get a timebomb in your venv where later
upgrades may leave you with a broken package :-(. (This is orthogonal
to the actual PEP.) In the long run, the idea that every package has
to contain code that knows how to implement installation in very
possible configuration (--user? --single-version-externally-managed?)
is clearly broken, and teaching people to use 'pip install' is
obviously the only sensible alternative.

-n

-- 
Nathaniel J. Smith -- http://vorpus.org