[Distutils] wacky idea about reifying extras

Mon Oct 26 03:36:25 EDT 2015

Hi all,

I had a wacky idea and I can't tell if it's brilliant, or ridiculous,
or both. Which makes sense given that I had a temperature of 39.3 when
I thought of it, but even after getting better and turning it over in
my mind for a while I still can't tell, so, I figured I'd put it out
there and see what y'all thought :-).

The initial motivation was to try and clean up an apparent infelicity
in how extras work. After poking at it for a bit, I realized that it
might accidentally solve a bunch of other problems, including removing
the need for dynamic metadata in sdists (!). If it can be made to work
at all...

Motivating problem
------------------

"Extras" are pretty handy: e.g., if you just want the basic ipython
REPL, you can do ``pip install ipython``, but if you want the fancy
HTML notebook interface that's implemented in the same source code
base but has lots of extra heavy dependencies, then you can do ``pip
install ipython[notebook]`` instead.

Currently, extras are implemented via dedicated metadata fields inside
each package -- when you ``pip install ipython[notebook]``, it
downloads ipython-XX.whl, and when it reads the metadata about this
package's dependencies and entry points, the ``notebook`` tag flips on
some extra dependencies and extra entry points.

This worked pretty well for ``easy_install``'s purposes, since
``easy_install`` mostly cared about installing things; there was no
concern for upgrading. But people do want to upgrade, and as ``pip``
gets better at this, extras are going to start causing problems.

Hopefully soon, pip will get a proper resolver, and then an
``upgrade-all`` command like other package managers have. Once this
happens, it's entirely possible -- indeed, a certainty in the long run
-- that if you do::

    $ pip install ipython[notebook]
    # wait a week
    $ pip upgrade-all

you will no longer have the notebook installed, because the new
version of ipython added a new dependency to the "notebook" extra, and
since there's no record that you ever installed that extra, this new
dependency won't be installed when you upgrade. I'm not sure what
happens to any entry points or scripts that were part of
ipython[notebook] but not ipython -- I'm guessing they'd still be
present, but broken? If you want to upgrade while keeping the notebook
around, then ``upgrade-all`` is useless to you; you have to manually
keep a list of all packages-with-extras you have installed and
explicitly pass them to the upgrade command every time. Which is
terrible UX.

Supporting extras in this manner also ramifies complexity through the
system: e.g., the specification of entry points becomes more complex
because you need a way to make them conditional on particular extras,
PEP 426 proposes special mechanisms to allow package A to declare a
dependency on extra B of package C, etc. And extras also have minor
but annoying limitations, e.g. there's no mechanism provided to store
a proper description of what an extra provides and why you might want
it.

Solution (?): promoting extras to first-class packages
------------------------------------------------------

There's an obvious possible solution, inspired by how other systems
(e.g. Debian) handle this situation: promote ``ipython[notebook]`` to
a full-fledged package, that happens to contain no files of its own,
but which gets its own dependencies and other metadata.

What would this package look like? Something like::

  Name: ipython[notebook]
  Version: 4.0.0
  Requires-Dist: ipython (= 4.0.0)
  Requires-Dist: extra_dependency_1
  Requires-Dist: extra_dependency_2
  Requires-Dist: ...

  The ``notebook`` extra extends IPython with an HTML interface to...

Installing it needs to automatically trigger the installation of
ipython, so it should depend on ipython. It needs to be upgraded in
sync with ``ipython``, so this dependency should be an exact version
dependency -- that way, upgrading ``ipython`` will (once we have a
real resolver!) force an upgrade of ``ipython[notebook]`` and
vice-versa. Then of course we also need to include the extra's unique
dependencies, and whatever else we want (e.g. a description).

What would need to happen to get there from here? AFAICT a relatively
small set of changes would actually suffice:

**PyPI:** starts allowing the upload of wheels named like
``BASE[EXTRA]-VERSION-COMPAT.whl``. They get special handling, though:
who-ever owns ``BASE`` gets to do whatever they like with names like
``BASE[EXTRA]``, that's an established rule, so wheels following this
naming scheme would be treated like other artifacts associated with
the (BASE, VERSION) release. In particular, the uploader would need to
have write permission to the ``BASE`` name, and it would remain
impossible to register top-level distribution names containing square
brackets.

**setuptools:** Continues to provide "extra" metadata inside the
METADATA file just as it does now (for backwards compatibility with
old versions of pip that encounter new packages). In addition, though,
the egg-info command would starts generating .egg-info directories for
each defined extra (according to the schema described above), the
bdist_wheel command would start generating a wheel file for each
defined extra, etc.

**pip:** Uses a new and different mechanism for looking up packages with extras:

- when asked to fulfill a requirement for ``BASE[EXTRA1,EXTRA2,...] (>
X)``, it should expand this to ``BASE[EXTRA1] (> X), BASE[EXTRA2] (>
X), ...``, and then attempt to find wheels with those actual names

- backcompat case: if we fail to find a BASE[EXTRA] wheel, then fall
back to fetching a wheel named BASE and attempt to use the "extra"
metadata inside it to generate BASE[EXTRA], and install this (this is
morally similar to the fallback logic where if it can't find foo.whl
it tries to generate it from the foo.zip sdist)

  - Optionally, PyPI itself could auto-generate these wheels for
legacy versions (since they can be generated automatically from static
wheel metadata), thus guaranteeing that this path would never be
needed, and then pip could disable this fallback path... but I guess
it would still need it to handle non-PyPI indexes.

- if this fails, then it falls back to fetching an sdist named BASE
(*not* BASE[EXTRA]) and attempting to build it (while making sure to
inject a version of setuptools that's recent enough to include the
above changes).

**PEP 426:** can delete all the special case stuff for extras, because
they are no longer special cases, and there's no backcompat needed for
a format that is not yet in use.

**twine and other workflows:** ``twine upload dist/*`` continues to do
the right thing (now including the new extra wheels). Other workflows
might need the obvious tweaking to include the new wheel files.

So this seems surprisingly achievable (except for the obvious glaring
problem that I missed but someone is about to point out?), would
improve correctness in the face of upgrades, and simplifies our
conceptual models, and provides a more solid basis for future
improvements (e.g. if in the future we add better tracking of which
packages were manually installed, then this will automatically apply
to extras as well, since they are just packages).

But wait there's more
---------------------

**Non-trivial plugins:** Once we've done this, suddenly extras become
much more powerful. Right now it's a hard constraint that extras can
only add new dependencies and entry points, not contain any code. But
this is rather artificial -- from the user's point of view, 'foo[bar]'
just means 'I want a version of foo that has the bar feature', they
don't care whether this requires installing some extra foo-to-bar shim
code. With extras as first-class packages, it becomes possible to use
this naming scheme for things like plugins or compiled extensions that
add actual code.

**Build variants:** People keep asking for us to provide numpy builds
against Intel's super-fancy closed-source (but freely redistributable)
math library, MKL. It will never be the case that 'pip install numpy'
will automatically give you the MKL-ified version, because see above
re: "closed source". But we could provide a
numpy[mkl]-{VERSION}-{COMPAT}.whl with metadata like::

  Name: numpy[mkl]
  Conflicts: numpy
  Provides: numpy

which acts as a drop-in replacement for the regular numpy for those
who explicitly request it via ``pip install numpy[mkl]``.

This involves two new concepts on top of the ones above:

Conflicts: is missing from the current metadata standards but (I
think?) trivial to implement in any real resolver. It means "I can't
be installed at the same time as something else which matches this
requirement". In a sense, it's actually an even more primitive concept
than a versioned requirement -- Requires: foo (> 2.0) is equivalent to
Requires: foo + Conflicts: foo (<= 2.0), but there's no way to expand
an arbitrary Conflicts in terms of Requires. (A minor but important
wrinkle: the word "else" is important there; you need a special case
saying that a package never conflicts with itself. But I think that's
the only tricky bit.)

Provides: is trickier -- there's some vestigial support in the current
standards and even in pip, but AFAICT it hasn't really been worked out
properly. The semantics are obvious enough (Provides: numpy means that
this package counts as being numpy; there's some subtleties around
what version of numpy it should count as but I think that can be
worked out), but it opens a can of worms, because you don't want to
allow things like::

  Name: numpy
  Provides: django

But once you have the concept of a namespace for multiple
distributions from the same project, then you can limit Provides: so
that it's only legal if the provider distribution and the provided
distribution have the same BASE. This solves the social problem (PyPI
knows that numpy[mkl] and numpy are 'owned' by the same people, so
this Provides: is OK), and provides algorithmic benefits (if you're
trying to find some package that provides foo[extra] out of a flat
directory of random distributions, then you only have to examine
wheels and sdists that have BASE=foo).

The other advantage to having the package be ``numpy[mkl]`` instead of
``numpy-mkl`` is that it correctly encodes that the sdist is
``numpy.zip``, not ``numpy-mkl.zip`` -- the rules we derived to match
how extras work now are actually exactly what we want here too.

**ABI tracking:** This also solves another use case entirely: the
numpy ABI tracking problem (which is probably the single #1 problem
the numerical crowd has with current packaging, because it actually
prevents us making basic improvements to our code -- the reason I've
been making a fuss about other things first is that until now I
couldn't figure out any tractable way to solve this problem, but now I
have hope). Once you have Provides: and a namespace to use with it,
then you can immediately start using "pure virtual" packages to keep
track of which ABIs are provided by a single distribution, and
determine that these two packages are consistent:

  Name: numpy
  Version: 1.9.2
  Provides: numpy[abi-2]
  Provides: numpy[abi-3]

  Name: scipy
  Depends: numpy
  Depends: numpy[abi-2]

(AFAICT this would actually make pip *better* than conda as far as
numpy's needs are concerned.)

The build variants and virtual packages bits also work neatly
together. If SciPy wants to provide builds against multiple versions
of numpy during the transition period between two ABIs, then these are
build variants exactly like numpy[mkl]. For their 0.17.0 release they
can upload::

  scipy-0.17.0.zip
  scipy[numpy-abi-2]-0.17.0.whl
  scipy[numpy-abi-3]-0.17.0.whl

(And again, it would be ridiculous to have to register
scipy-numpy-abi-2, scipy-numpy-abi-3, etc. on PyPI, and upload
separate sdists for each of them. Note that there's nothing magical
about the names -- those are just arbitrary tags chosen by the
project; what pip would care about is that one of the wheels' metadata
says Requires-Dist: numpy[abi-2] and the other says Requires-Dist:
numpy[abi-3].)

So far as I can tell, these build variant cases therefore cover *all
of the situations that were discussed in the previous thread* as
reasons why we can't necessarily provide static metadata for an sdist.
The numpy sdist can't statically declare a single set of install
dependencies for the resulting wheel... but it could give you a menu,
and say that it knows how to build numpy.whl, numpy[mkl].whl, or
numpy[external-blas].whl, and tell you what the dependencies will be
in each case. (And maybe it's also possible to make numpy[custom] by
manually editing some configuration file or whatever, but pip would
never be called upon to do this so it doesn't need the static
metadata.) So I think this would be sufficient to let us start
providing full static metadata inside sdists?

(Concretely, I imagine that the way this would work is that when we
define the new sdist hooks, one of the arguments that pip would pass
in when running the build system would be a list of the extras that
it's hoping to see, e.g. "the user asked for numpy[mkl], please
configure yourself accordingly". For legacy setuptools builds that
just use traditional extras, this could safely be ignored.)

TL;DR
-----

If we:

- implement a real resolver, and
- add a notion of a per-project namespace of distribution names, that
are collected under the same PyPI registration and come from the same
sdist, and
- add Conflicts:, and Provides:,

then we can elegantly solve a collection of important and difficult
problems, and we can retroactively pun the old extras system onto the
new system in a way that preserves 100% compatibility with all
existing packages.

I think?

What do you think?

-n

-- 
Nathaniel J. Smith -- http://vorpus.org