[Distutils] metadata in sdists (was Re: Second draft of a plan for a new source tree / sdist format)

Nathaniel Smith njs at pobox.com
Thu Oct 29 19:23:00 EDT 2015

On Wed, Oct 28, 2015 at 5:32 AM, Daniel Holth <dholth at gmail.com> wrote:
> Nathaniel,
> I'm not sure what the software is supposed to do with fine grained dynamic
> metadata that would make very much sense to the end user. I think you could
> probably get away with a single flag Dynamic: true / false. Iff true, pip
> runs the dist-info command after installing bootstrap dependencies. You
> could still complain if the name & version changed. Of course in a VCS
> checkout or during development you probably always want the
> regenerate-metadata behavior.

So, right, the draft I just posted proposes that sdists should contain
static wheel-style metadata where possible, with a fine-grained
mechanism for marking specific parts as being dynamically generated at
build time. The different motivations that I'm trying to balance here

- Donald wants the fields that PyPI cares about (name, version,
summary, long_description, author, homepage, trove classifiers, etc.)
to *always* be statically present in sdists, even if there is some
other metadata that can't be made static (i.e., install-requirements)

- Robert wants the (name, version, install-requirements) information
to be available statically as often as possible, because that's the
information that the resolver needs to know when it's considering
trying to install some (name, version) pair, and he wants this
information to be cheap to access because the resolver may have to
backtrack and consider many different versions of the same package.
Installing build-requirements and then running egg-info/dist-info is
already pretty expensive, so it would be good if install-requirements
could be static in the 99% of cases where they are known.

- I'm still holding out some small hope of killing off the
"egg-info"/"dist-info" step entirely, because any time you have two
different operations and it's a bug if they can get out of sync (in
this case: the dist-info step, and then the actual wheel building
step), then maintaining and testing is a hassle and eventually that
bug will happen. Basically it's a violation of DRY -- you have two
sources of the same data and they're both supposed to be
authoritative. And we know from experience with egg_info that there
will be strong temptations for people to cheat, which creates all
kinds of headaches.

So how can we balance these different design goals? Obviously a new
sdist format should provide static metadata for the fields PyPI cares
about -- we basically require this already for sdists, we just encode
it in a weird out-of-band way during the PyPI upload instead of
recording it authoritatively in the sdist itself. Then that leaves a
2x2 space of plausible design options: we can either support dist-info
as a separate operation from building, or not; and we can make a
best-effort to provide static install-requirements when they're
available, or we can not bother and just never provide static

None of this matters for 'pip install <source directory>'

None of this matters for installations using wheels.

None of this matters for installations where the requirements are
straightforward to satisfy (e.g. cases where the current
pseudo-resolver works okay).

None of this affects correctness -- it's purely an optimization. But
maybe it's an important optimization in certain specific cases.

The case where this matters is like: suppose you just did 'pip install
scikit-learn', and scikit-learn requires scipy and numpy. And scipy
also requires numpy, so our dependency graph forms a triangle. And
let's say that scikit-learn is happy with any version of scipy, but
scikit-learn and scipy both have versioned requirements on numpy. So
for example, scikit-learn might require numpy (< 1.9). Meanwhile, the
latest version of scipy requires numpy (>= 1.9)... so there's no
version of numpy that satisfies both at once, which means we can't use
this version of scipy. So you have to consider the next-to-latest
version of scipy. But then it turns out that it also requires numpy
(>= 1.9). So then you have to consider the next-to-next-to-latest
version of scipy, which turns out to require numpy (>= 1.8), so phew,
we can use numpy = 1.8 to satisfy both of these at once and we're
done. And, to make this maximally annoying, it turns out that there is
no scipy wheel for your platform, so all this checking of scipy's
install-requires has to be done based on whatever information is
inside the scipy sdist. And finally, the reason we use scipy in this
example is that building scipy turns out to be really slow (~8 minutes
on my laptop), definitely an outlier among python packages.

There's three ways this can play out:

Option 1, "fast": if the scipy sdists have static install-requires
metadata, then this is pretty trivial. You might have to download the
sdists for the first two non-working versions, but once you have the
files you can immediately tell from the metadata that they aren't
going to be usable and move on. (Of course, scipy itself probably
*can't* have static install-requires metadata, but most packages can.)

Option 2, "slow": if the scipy sdists don't have static
install-requires metadata, but do support the dist-info operation,
then you do something like:
- download and unpack scipy-XX.zip
- discover that it build-requires numpy. If there's no scipy wheel for
your platform, there's presumably no numpy wheel either, so we'll have
to build it.
- download and build numpy, and install your newly built numpy into
scipy's build environment (~2 minutes?)
- run scipy's dist-info (a few seconds?)
- go back to the top and repeat for the next candidate

Option 3, "slowest": if you don't have static install-requires *and*
you don't have the dist-info operation, then it's identical to the
"slow" option, except instead of dist-info you have to actually run
the full scipy build at each step, so it takes ~10 minutes per cycle
instead of ~2 minutes per cycle (or maybe a bit less depending on
whether the build-requirements work out so that you can re-use the
same build of numpy for both attempts, etc.)

Now, our four design options break down like this:

no dist-info / never record install-requirements in sdist:
100% of sdists are in the "slowest" path

no dist-info / record install-requirements in sdist when available:
95% of sdists are on "fast" path, 5% (the ones with dynamic
install-requirements, like scipy) are on the "slowest" path

dist-info / never record install-requirements in sdist (= current sdist format):
100% of sdists are in the "slow" path

dist-info / record install-requirements in sdist when available:
95% of sdists are on "fast" path, 5% are on the "slow" path

*** Summary ***

There are two different ways to speed up resolution of dependencies
involving sdists.

static install-requirements are super fast when they work, and their
complexity cost is low but non-zero. (We know we want to include lots
of other metadata in sdists, and we already have standard ways to
represent lots of metadata + install-requirements, so the only new
thing is the __SDIST_DYNAMIC__ bit.)

dist-info is slower and always works, but has a higher complexity
cost. (Every build backend needs to implement this extra operation,
and it's a bit fragile.)

If you think that:
- automagic building from sdists is an important case to support
seamlessly and will remain so indefinitely, and
- these kinds of complicated conflicts are common, and
- it's important to optimize them to be resolved as fast as possible,
then you'll want dist-info + static install-requirements. (IIUC this
is Robert's position.)

OTOH if you think that:
- automagic building of sdists is a mixed blessing at best (at least
in this particular example, what mostly actually happens is that
people get angry and swear at their computer because they didn't want
'pip install scikit-learn' to go rebuild scipy and numpy, argh why is
it my laptop suddenly swapping to death / argh I don't actually have a
functional toolchain installed because I'm on windows / argh the
resulting builds will seem to work at first but in fact be horribly
slow because my BLAS is in funny place where it won't be autodetected
and so they'll fall back on unoptimized routines / ... see also Paul's
requests to just turn off sdist building entirely), and
- by the time these new formats are implemented and in wide use in a
few years then the increased maturity of the wheel ecosystem + Linux
wheels on PyPI + Donald's planned automated build servers will mean
that automagic building of sdists will only be rarely needed practice,
- these kinds of pathological requirements conflicts are going to be
pretty rare and anyway when they do arise then you don't care whether
it takes 60 minutes or merely 20 to sort out scikit-learn's
dependencies because you're probably going to hit Control-C either
then you might prefer the no-dist-info + no static
install-requirements approach, because it simplifies the build system
/ packaging stuff :-). (That was the reasoning behind my original

And if you like that second argument, but have gotten push-back from
people like Robert, then you might try writing up a proposal for
static metadata in sdists, in the hopes that no-dist-info /
yes-static-metadata might be an acceptable compromise between the
above two positions :-).

This discussion is important structurally is because it's where the
design of a new build system interface becomes coupled to the design
of a new sdist format: if you want to get a new build system interface
done as soon as possible and don't want to touch the static metadata
stuff, then the choice is between "no dist-info / no static metadata"
versus "yes dist-info / no static-metadata", and if these are your
options then it pushes you strongly towards supporting a dist-info
operation. But then you risk saddling all build systems with the
requirement to implement this dist-info thing, even if it later turns
out that static sdist metadata makes it unnecessary.

As mentioned in the other thread about promoting extras to first-class
packages, there is still some hope that we might be able to get away
with having full static metadata in all sdists, in which case the
dist-info operation becomes completely vestigial -- but that
discussion is going to be much more involved and certainly requires
the addition of a real resolver for pip. So we probably don't want to
block all progress on better build system while waiting for that.

One possible compromise to streamline things might be:
- give up on specifying a new sdist format for now (sorry Donald!) --
it's a good idea, we have some idea how to do it, but treat it as a
separate project
- include a dist-info operation in the new standard build system interface...
- ...but make it optional.

The cost of making it optional is that pip would need to be prepared
with a fallback if the dist-info operation is not available -- so the
get-install-requirements step would look like:

def get_install_requirements(source_directory):
    if supports_dist_info(source_directory):
        dist_info_directory = run_dist_info_hook(source_directory)
        return dist_info_to_install_requirements(dist_info_directory)
        wheel = run_wheel_build_hook(source_directory)
        return wheel_to_install_requirements(wheel)

Compared to a design where dist-info is mandatory, this is some extra
complexity added to pip, but not a huge amount -- the above is
pseudo-code, but aside from supports_dist_info all the functions it
calls are ones that need to exist anyway, and the extra logic is
neatly encapsulated within get_install_requirements. The minor
advantage of doing this is that if some project/build system doesn't
need this logic they don't need to implement it (e.g. for flit, which
only supports pure python packages, there's absolutely no use -- it
can build a wheel as fast as it can generate just the dist-info, and
it doesn't even matter because projects using flit will ~always have
wheels available and the resolver will ~never need to look at an
sdist). The major advantage is that if later on it turns out that
dist-info is useless (e.g. because sdists start shipping with full
static metadata), then making it optional now means that we'll have
the option of just dropping the support from build systems and from
pip without breaking backcompat. (Compare to the situation where we
make it mandatory now: since there will be versions of pip in the wild
that blow up if the operation is not provided, every build system may
be stuck providing some sort of dist-info support for a decade or
whatever, even if it's not needed.)

Alternatively we could just leave dist-info out for now, and implement
it later as an optional optimization that a build system can provide
if it turns out to matter :-). This would definitely help move things
forward more quickly, because I wouldn't be surprised if the most
complicated part of implementing Robert's current spec turned out to
be the code needed to validate that dist-info and build-wheel actually
return matching metadata :-). You absolutely need to do this by
default, because otherwise we'll be back where we are now with people
intentionally playing horrible fragile tricks, and weird accidental
inconsistencies going undetected and creating subtle bugs. But to
implement this checking you need some complex flow control to make
sure that the eventual build-wheel step has access to the original
metadata, plus the code for checking metadata-equality might be


