[Distutils] wacky idea about reifying extras

Mon Oct 26 19:41:21 EDT 2015

On Mon, Oct 26, 2015 at 4:41 AM, Donald Stufft <donald at stufft.io> wrote:
> On October 26, 2015 at 3:36:47 AM, Nathaniel Smith (njs at pobox.com) wrote:
>> > TL;DR
>> -----
>>
>> If we:
>>
>> - implement a real resolver, and
>> - add a notion of a per-project namespace of distribution names,
>> that
>> are collected under the same PyPI registration and come from
>> the same
>> sdist, and
>> - add Conflicts:, and Provides:,
>>
>> then we can elegantly solve a collection of important and difficult
>> problems, and we can retroactively pun the old extras system
>> onto the
>> new system in a way that preserves 100% compatibility with all
>> existing packages.
>>
>> I think?
>>
>> What do you think?
>
> My initial reaction when I started reading your idea was that I didn't see a
> point in having something like foo[bar] be a "real" package when you could just
> as easily have foo-bar. However, as I continued to read through the idea it
> started to grow on me. I think I need to let it percolate in my brain a little
> bit, but there may be a non-crazy (or at least, crazy in a good way) idea here
> that could push things forward in a nice way.

Oh good, at least I'm not the only one :-).

I'd particularly like to hear Robert's thoughts when he has time,
since the details depend strongly on some assumptions about how a real
resolver would work.

> Some random thoughts:
>
> * Reusing the extra syntax is nice because it doesn't require end users to
>   learn any new concepts, however we shouldn't take a new syntax off the table
>   either if it makes the feature easier to implement with regards to backwards
>   compatability. Something like numpy{mkl,some-other-thing} could work just as
>   well too. We'll need to make sure that whatever symbols we choose can be
>   represented on all the major FS we care about and that they are ideally non
>   ugly in an URL too. Of course, the filename and user interface symbols don't
>   *need* to match. It could just as easily example numpy[mkl] out to numpy#mkl
>   or whatever which should make it easier to come up with a nice scheme.

Right -- obviously it would be *nice* to keep the number of concepts
down, and to avoid skew between filenames and user interface (because
users do see filenames), but if these turn out to be impossible then
there are still options that would let us save the
per-project-package-namespace idea.

> * Provides is a bit of an odd duck, I think in my head I've mostly come to
>   terms with allowing unrestricted Provides when you've already installed the
>   package doing the Providing but completely ignoring the field when pulling
>   data from a repository. Our threat model assumes that once you've selected to
>   install something then it's generally safe to trust (though we still do try
>   to limit that). The problem with Provides mostly comes into play when you
>   will respect the Provides: field for any random package on PyPI (or any other
>   repo).

Yeah, I'm actually not too worried about malicious use either in
practice, for the reason you say. But even so I can think of two good
reasons we might want to be careful about stating exactly when
"Provides:" can be trusted:

1) if you have neither scipy nor numpy installed, and you do 'pip
install scipy', and scipy depends on the pure virtual package
'numpy[abi-2]' which is only available as a Provides: on the concrete
package 'numpy', then in this case the resolver has to take Provides:
into account when pulling data from the repo -- if it doesn't, then
it'll ignore the Provides: on 'numpy' and say that scipy's
dependencies can't be satisfied. So for this use case to work, we
actually do need to be able to sometimes trust Provides: fields.

2) the basic idea of a resolver is that it considers a whole bunch of
possible configurations for your environment, and picks the
configuration that seems best. But if we pay attention to different
metadata when installing as compared to after installation, then this
skew makes it possible for the algorithm to pick a configuration that
looks good a priori but is broken after installation. E.g. for a
simple case:

  Name: a
  Conflicts: some-virtual-package

  Name: b
  Provides: some-virtual-package

'pip install a b' will work, because the resolver ignores the
Provides: and treats the packages as non-conflicting -- but then once
installed we have a broken system. This is obviously an artificial
example, but creating the possibility of such messes just seems like
the kind of headache we don't need. So I think whatever we do with
Provides:, we should do the same thing both before and after
installation.

A simple safe rule is to say that Provides: is always legal iff a
package's Name: and Provides: have a matching BASE, and always illegal
otherwise, making the package just invalid, like if the METADATA were
written in Shift-JIS or something. This rule is trivial to statically
check/enforce, and could always be relaxed more later.

> * The upgrade mess around extras as they stand today could also be solved just
>   by recording what extras (if any) were selected to be installed so that we
>   keep a consistent view of the world. Your proposal is essentially doing that,
>   just by (ab)using the fact that by installing a package we essentially get
>   that aspect of it for "free".

Right -- you certainly could implement a database of installed extras
to go alongside the database of installed packages, but it seems like
it just makes things more complicated with minimal benefit. E.g., you
have to add special case code to the resolver to check both databases,
and then you have to add more special case code to 'pip freeze' to
make sure *it* checks both databases... this kind of stuff adds up.

> * Would this help at all with differentiating between SSE2 and SSE3 builds and
>   things like that? Or does that need something more automatic to be really
>   usable?

I'm not convinced that SSE2 versus SSE3 is really worth trying to
handle automatically, just because we have more urgent issues and
everyone else in the world seems to get by okay without special
support for this in their package system (even if it's not always
optimal). But if we did want to do this then my intuition is that it'd
be more elegant to do it via the wheel platform/architecture field,
since this actually is a difference in architectures? So you could
have one wheel for the "win32" platform and another wheel for the
"win32sse3" platform, and the code in the installer that figures out
which wheels are compatible would know that both of these are
compatible with the machine it was running on (or not), and that
win32sse3 is preferable to plain win32.

> * PEP 426 (I think it was?) has some extra syntax for extras which could
>   probably be really nice here, things like numpy[*] to get *all* of the extras
>   (though if they are real packages, what even is "all"?). It also included
>   (though this might have been only in my head) default to installed packages
>   which meant you could do something like split numpy into numpy[abi2] and
>   numpy[abi3] packages and have the different ABIs actually contained within
>   those other packages. Then you could have your top level package default to
>   installing abi3 and abi2 so that ``pip install numpy`` is equivilant to
>   ``pip install numpy[abi2,abi3]``. The real power there, is that people can
>   trim down their install a bit by then doing ``pip install numpy[-abi2]`` if
>   they don't want to have that on-by-default feature.

Hmm, right, I'm not thinking of a way to *quite* duplicate this.

One option would be to have a numpy[all] package that just depends on
all the other extras packages -- for the traditional 'extra' cases
this could be autogenerated by setuptools at build time and then be a
regular package after that, and for next-generation build systems that
had first-class support for these [] packages, it would be up to the
build system / project whether to generate such an [all] package and
what to include in it if they did. But that doesn't give you the
special all-except-for-one behavior.

The other option that jumps to mind is what Debian calls "recommends",
which act like a soft-dependency: in debian, if numpy recommends:
numpy[abi-2] and numpy[abi-3], then 'apt-get install numpy' would give
you all three of them by default, just like if numpy required them --
but for recommends: you can also say something like 'apt-get install
numpy -numpy[abi-3]' if you want numpy without the abi-3 package, or
'apt-get install --no-recommends numpy' if you want a fully minimal
install, and this is okay because these are only *recommendations*,
not an actual requirements. I don't see any fundamental reasons why we
couldn't add something like this to pip, though it's probably not that
urgent.

My guess is that these two solutions together would pretty much cover
the relevant use cases?

-n

-- 
Nathaniel J. Smith -- http://vorpus.org