[Distutils] Towards a simple and standard sdist format that isn't intertwined with distutils

Sat Oct 3 00:52:40 CEST 2015

On 2 October 2015 at 23:15, Nathaniel Smith <njs at pobox.com> wrote:
> "Project" is a pretty messy concept. Obviously in simple cases there's
> a one-to-one mapping between project <-> wheel <-> importable package,
> but this breaks down quickly in edge cases.

I mistakenly used "project" in an attempt to avoid confusion resulting
from me using the word "distribution" as a more general term than the
way you were using "source distribution" or "binary distribution".
Clearly I failed and made things more confusing.

I use the term "distribution" in the sense used here
https://packaging.python.org/en/latest/glossary/#term-distribution-package.
Note that this is in contrast to the terms "source distribution" and
"binary distribution" or "built distribution" in the same page.

Sorry for confusing things. I'll stick to the terminology as in the
PUG glossary from now on.

> Consider a project that provides builds multiple wheels out of the
> same source tree. You obviously can't expect that all of these
> packages will have the same dependencies.

Correct. But a distribution can and should (I believe) have the same
dependencies for all of the source and built distributions derived
from it.

> This situation is not common today for Python packages, but the only
> reason for that is that distutils makes it really hard to do -- it's
> extremely common in other package ecosystems, and the advantages are
> obvious. E.g., maybe numpy.distutils should be split into a separately
> installable package from numpy -- there's no technical reason that
> this should mean we are now forced to move the code for it into its
> own VCS repository.

I'm lost here, I'm afraid. Could you rephrase this in terms of the
definitions from the PUG glossary? It sounds to me like the VCS
repository is the project, which contains multiple distributions. I
don't see how that's particularly hard. Each distribution just has its
own subdirectory (and setup.py) in the VCS repository...

> (I assume that by "platform tags" you mean what PEP 426 calls
> "environment markers".)

Nope, I mean as defined in PEP 425. The platform tag is part of the
compatibility tag. Maybe I meant the ABI tag, I don't really follow
the distinctions.

> Environment markers are really useful for extending the set of cases
> that can be handled by a single architecture-dependent wheel. And
> they're a good fit for that environment, given that wheels can't
> contain arbitrary code.
>
> But they're certainly never going to be adequate to provide a single
> static description of every possible build configuration of every
> possible project. And installing an sdist already requires arbitrary
> code execution, so it doesn't make sense to try to build some
> elaborate system to avoid arbitrary code execution just for the
> dependency specification.
>
> You're right that in a perfect future world numpy C API related
> dependencies would be handling by some separate ABI-tracking mechanism
> similar to how the CPython ABI is tracked, so here are some other
> examples of why environment markers are inadequate:
>
> In the future it will almost certainly be possible to build numpy in
> two different configurations: one where it expects to find BLAS inside
> a wheel distributed for this purpose (e.g. this is necessary to
> provide high-quality windows wheels), and one where it expects to find
> BLAS installed on the system. This decision will *not* be tied to the
> platform, but be selectable at build time. E.g., on OS X there is a
> system-provided BLAS library, but it has some issues. So the default
> wheels on PyPI will probably act like windows and depend on a
> BLAS-package that we control, but there will also be individual users
> who prefer to build numpy in the configuration where it uses the
> system BLAS, so we definitely need to support both options on OS X.
> Now the problem: There will never be a single environment marker that
> you can stick into a wheel or sdist that says "we depend on the
> 'pyblas' package if the system is OS X (ok) and the user set this flag
> in this configuration file during the build process (wait wut)".
>
> Similarly, I think someone was saying in a discussion recently that
> lxml supports being built either in a mode where it requires libxml be
> available on the system, or else it can be statically linked. Even if
> in the future we start having metadata that lets us describe
> dependencies on external system libraries, it's never going to be the
> case that we can put the *same* dependency metadata into wheels that
> are built using these two configurations.

This is precisely the very complex issue that's being discussed under
the banner of extending compatibility tags in a way that gives a
viable but practical way of distinguishing binary wheels. You can
either see that as a discussion about "expanding compatibility tags"
or "finding something better than compatibility tags". I don't have
much of a stake in that discussion, as the current compatibility tags
suit my needs fine, as a Windows user. The issues seem to be around
Linux and possibly some of the complexities around binary dependencies
for numerical libraries.

But the key point here is that I see the solution for this as being
about distinguishing the "right" wheel for the target environment.
It's not about anything that should reach back to sdists. Maybe a
solution will involve a PEP 426 metadata enhancement that adds
metadata that's only valid in binary distributions and not in source
distributions, but that's fine by me. But it won't replace the
existing dependency data, which *is* valid at the sdist level.

At least as far as I can see - I'm willing to be enlightened. But your
argument seems to be that sdist-level dependency information should be
omitted because more detailed ABI compatibility data *might* be needed
at the wheel level for some packages. I don't agree with that - we
still need the existing metadata, even if more might be required in
specialist cases.

>> [1] If extras and environment markers don't cover the needs of
>> scientific modules, we need some input into their design from the
>> scientific community. But again, let's not throw away the work that's
>> already done.
>
> As far as sdists go, you can either cover 90% of the cases by building
> increasingly elaborate metadata formats, or you can cover 100% of the
> cases by keeping things simple...

But your argument seems to be that having metadata generated from
package build code is "simpler". My strong opinion, based on what I've
seen of the problems caused by having metadata in an "exectable
setup.py", is that static metadata is far simpler.

I don't believe that the cost of changing to a new system can be
justified *without* getting the benefits of static metadata.

Paul