[Distutils] Towards a simple and standard sdist format that isn't intertwined with distutils

Nathaniel Smith njs at pobox.com
Sat Oct 3 00:15:38 CEST 2015

On Fri, Oct 2, 2015 at 1:42 PM, Paul Moore <p.f.moore at gmail.com> wrote:
> On 2 October 2015 at 21:19, Nathaniel Smith <njs at pobox.com> wrote:
>>> One of the problems with the current system, is that we have no mechanism by
>>> which to determine dependencies of a source distribution without downloading
>>> the file and executing some potentially untrusted code. This makes dependency
>>> resolution harder and much much slower than if we could read that information
>>> statically from a source distribution. This PEP doesn't offer anything in the
>>> way of solving this problem.
>> What are the "dependencies of a source distribution"? Do you mean the
>> runtime dependencies of the wheels that will be built from a source
>> distribution?
>> If you need that metadata to be statically in the sdist, then you
>> might as well give up now because it's simply impossible.
>> As the very simplest example, every package that uses the numpy C API
>> gets a runtime dependency on "numpy >= [whatever version happened to
>> be installed on the *build* machine]". There are plenty of more
>> complex examples too (e.g. ones that involve build/configure-time
>> decisions about whether to rely on particular system libraries, or
>> build/configure-time decisions about whether particular packages
>> should even be built).
> I'm really not at all clear what you're saying here. It's quite
> possible that those of us who don't understand the complexities of the
> scientific/numpy world are missing something important, but if so it
> would be useful if you could spell out the problems in detail.
> From my point of view, it's not a source distribution or a binary
> distribution that depends on something (numpy or whatever) - it's the
> *project*. If project foo needs numpy to work, it depends on numpy. If
> it depends on features in numpy 1.9, it depends on numpy>=1.9.
> Optional dependencies are covered by extras, and environment specific
> dependencies are covered by environment markers.[1] That remains true
> for all wheels that are built from that project, for whatever platform
> using whatever tools. It should also be true for the source
> distribution, precisely *because* it's independent of the build
> process.

"Project" is a pretty messy concept. Obviously in simple cases there's
a one-to-one mapping between project <-> wheel <-> importable package,
but this breaks down quickly in edge cases.

Consider a project that provides builds multiple wheels out of the
same source tree. You obviously can't expect that all of these
packages will have the same dependencies.

This situation is not common today for Python packages, but the only
reason for that is that distutils makes it really hard to do -- it's
extremely common in other package ecosystems, and the advantages are
obvious. E.g., maybe numpy.distutils should be split into a separately
installable package from numpy -- there's no technical reason that
this should mean we are now forced to move the code for it into its
own VCS repository.

> I can understand that a binary wheel may need a certain set of
> libraries installed - but that's about the platform tags that are part
> of the wheel definition, not about dependencies. Platform tags are an
> ongoing discussion, and a good example of a partial solution that
> needs to be extended, certainly, but they aren't really relevant in
> any way that I can see to how the build chain works.

(I assume that by "platform tags" you mean what PEP 426 calls
"environment markers".)

Environment markers are really useful for extending the set of cases
that can be handled by a single architecture-dependent wheel. And
they're a good fit for that environment, given that wheels can't
contain arbitrary code.

But they're certainly never going to be adequate to provide a single
static description of every possible build configuration of every
possible project. And installing an sdist already requires arbitrary
code execution, so it doesn't make sense to try to build some
elaborate system to avoid arbitrary code execution just for the
dependency specification.

You're right that in a perfect future world numpy C API related
dependencies would be handling by some separate ABI-tracking mechanism
similar to how the CPython ABI is tracked, so here are some other
examples of why environment markers are inadequate:

In the future it will almost certainly be possible to build numpy in
two different configurations: one where it expects to find BLAS inside
a wheel distributed for this purpose (e.g. this is necessary to
provide high-quality windows wheels), and one where it expects to find
BLAS installed on the system. This decision will *not* be tied to the
platform, but be selectable at build time. E.g., on OS X there is a
system-provided BLAS library, but it has some issues. So the default
wheels on PyPI will probably act like windows and depend on a
BLAS-package that we control, but there will also be individual users
who prefer to build numpy in the configuration where it uses the
system BLAS, so we definitely need to support both options on OS X.
Now the problem: There will never be a single environment marker that
you can stick into a wheel or sdist that says "we depend on the
'pyblas' package if the system is OS X (ok) and the user set this flag
in this configuration file during the build process (wait wut)".

Similarly, I think someone was saying in a discussion recently that
lxml supports being built either in a mode where it requires libxml be
available on the system, or else it can be statically linked. Even if
in the future we start having metadata that lets us describe
dependencies on external system libraries, it's never going to be the
case that we can put the *same* dependency metadata into wheels that
are built using these two configurations.

> You seem to be saying that wheels need a dependency on "the version of
> numpy they were built against". That sounds to me like a binary
> compatibility requirement that platform tags are intended to cover. It
> may well be a requirement that platform tags need significant
> enhancement (maybe even redesign) to cover, but it's not a dependency
> in the sense that pip and the packaging PEPs use the term. And if my
> understanding is correct, I'm against trying to fit that information
> into a dependency simply to work around the current limitations of the
> platform tag mechanism.
> I'm all in favour of new initiatives to make progress in areas that
> are currently stalled (we definitely need people willing to
> contribute) but we really don't have the resources to throw away the
> progress we've already made. Even if some of the packaging PEPs are
> still works in progress, what is there represents an investment we
> need to build on, not bypass.
> Paul
> [1] If extras and environment markers don't cover the needs of
> scientific modules, we need some input into their design from the
> scientific community. But again, let's not throw away the work that's
> already done.

As far as sdists go, you can either cover 90% of the cases by building
increasingly elaborate metadata formats, or you can cover 100% of the
cases by keeping things simple...


Nathaniel J. Smith -- http://vorpus.org

More information about the Distutils-SIG mailing list