wacky idea about reifying extras

Hi all, I had a wacky idea and I can't tell if it's brilliant, or ridiculous, or both. Which makes sense given that I had a temperature of 39.3 when I thought of it, but even after getting better and turning it over in my mind for a while I still can't tell, so, I figured I'd put it out there and see what y'all thought :-). The initial motivation was to try and clean up an apparent infelicity in how extras work. After poking at it for a bit, I realized that it might accidentally solve a bunch of other problems, including removing the need for dynamic metadata in sdists (!). If it can be made to work at all... Motivating problem ------------------ "Extras" are pretty handy: e.g., if you just want the basic ipython REPL, you can do ``pip install ipython``, but if you want the fancy HTML notebook interface that's implemented in the same source code base but has lots of extra heavy dependencies, then you can do ``pip install ipython[notebook]`` instead. Currently, extras are implemented via dedicated metadata fields inside each package -- when you ``pip install ipython[notebook]``, it downloads ipython-XX.whl, and when it reads the metadata about this package's dependencies and entry points, the ``notebook`` tag flips on some extra dependencies and extra entry points. This worked pretty well for ``easy_install``'s purposes, since ``easy_install`` mostly cared about installing things; there was no concern for upgrading. But people do want to upgrade, and as ``pip`` gets better at this, extras are going to start causing problems. Hopefully soon, pip will get a proper resolver, and then an ``upgrade-all`` command like other package managers have. Once this happens, it's entirely possible -- indeed, a certainty in the long run -- that if you do:: $ pip install ipython[notebook] # wait a week $ pip upgrade-all you will no longer have the notebook installed, because the new version of ipython added a new dependency to the "notebook" extra, and since there's no record that you ever installed that extra, this new dependency won't be installed when you upgrade. I'm not sure what happens to any entry points or scripts that were part of ipython[notebook] but not ipython -- I'm guessing they'd still be present, but broken? If you want to upgrade while keeping the notebook around, then ``upgrade-all`` is useless to you; you have to manually keep a list of all packages-with-extras you have installed and explicitly pass them to the upgrade command every time. Which is terrible UX. Supporting extras in this manner also ramifies complexity through the system: e.g., the specification of entry points becomes more complex because you need a way to make them conditional on particular extras, PEP 426 proposes special mechanisms to allow package A to declare a dependency on extra B of package C, etc. And extras also have minor but annoying limitations, e.g. there's no mechanism provided to store a proper description of what an extra provides and why you might want it. Solution (?): promoting extras to first-class packages ------------------------------------------------------ There's an obvious possible solution, inspired by how other systems (e.g. Debian) handle this situation: promote ``ipython[notebook]`` to a full-fledged package, that happens to contain no files of its own, but which gets its own dependencies and other metadata. What would this package look like? Something like:: Name: ipython[notebook] Version: 4.0.0 Requires-Dist: ipython (= 4.0.0) Requires-Dist: extra_dependency_1 Requires-Dist: extra_dependency_2 Requires-Dist: ... The ``notebook`` extra extends IPython with an HTML interface to... Installing it needs to automatically trigger the installation of ipython, so it should depend on ipython. It needs to be upgraded in sync with ``ipython``, so this dependency should be an exact version dependency -- that way, upgrading ``ipython`` will (once we have a real resolver!) force an upgrade of ``ipython[notebook]`` and vice-versa. Then of course we also need to include the extra's unique dependencies, and whatever else we want (e.g. a description). What would need to happen to get there from here? AFAICT a relatively small set of changes would actually suffice: **PyPI:** starts allowing the upload of wheels named like ``BASE[EXTRA]-VERSION-COMPAT.whl``. They get special handling, though: who-ever owns ``BASE`` gets to do whatever they like with names like ``BASE[EXTRA]``, that's an established rule, so wheels following this naming scheme would be treated like other artifacts associated with the (BASE, VERSION) release. In particular, the uploader would need to have write permission to the ``BASE`` name, and it would remain impossible to register top-level distribution names containing square brackets. **setuptools:** Continues to provide "extra" metadata inside the METADATA file just as it does now (for backwards compatibility with old versions of pip that encounter new packages). In addition, though, the egg-info command would starts generating .egg-info directories for each defined extra (according to the schema described above), the bdist_wheel command would start generating a wheel file for each defined extra, etc. **pip:** Uses a new and different mechanism for looking up packages with extras: - when asked to fulfill a requirement for ``BASE[EXTRA1,EXTRA2,...] (> X)``, it should expand this to ``BASE[EXTRA1] (> X), BASE[EXTRA2] (> X), ...``, and then attempt to find wheels with those actual names - backcompat case: if we fail to find a BASE[EXTRA] wheel, then fall back to fetching a wheel named BASE and attempt to use the "extra" metadata inside it to generate BASE[EXTRA], and install this (this is morally similar to the fallback logic where if it can't find foo.whl it tries to generate it from the foo.zip sdist) - Optionally, PyPI itself could auto-generate these wheels for legacy versions (since they can be generated automatically from static wheel metadata), thus guaranteeing that this path would never be needed, and then pip could disable this fallback path... but I guess it would still need it to handle non-PyPI indexes. - if this fails, then it falls back to fetching an sdist named BASE (*not* BASE[EXTRA]) and attempting to build it (while making sure to inject a version of setuptools that's recent enough to include the above changes). **PEP 426:** can delete all the special case stuff for extras, because they are no longer special cases, and there's no backcompat needed for a format that is not yet in use. **twine and other workflows:** ``twine upload dist/*`` continues to do the right thing (now including the new extra wheels). Other workflows might need the obvious tweaking to include the new wheel files. So this seems surprisingly achievable (except for the obvious glaring problem that I missed but someone is about to point out?), would improve correctness in the face of upgrades, and simplifies our conceptual models, and provides a more solid basis for future improvements (e.g. if in the future we add better tracking of which packages were manually installed, then this will automatically apply to extras as well, since they are just packages). But wait there's more --------------------- **Non-trivial plugins:** Once we've done this, suddenly extras become much more powerful. Right now it's a hard constraint that extras can only add new dependencies and entry points, not contain any code. But this is rather artificial -- from the user's point of view, 'foo[bar]' just means 'I want a version of foo that has the bar feature', they don't care whether this requires installing some extra foo-to-bar shim code. With extras as first-class packages, it becomes possible to use this naming scheme for things like plugins or compiled extensions that add actual code. **Build variants:** People keep asking for us to provide numpy builds against Intel's super-fancy closed-source (but freely redistributable) math library, MKL. It will never be the case that 'pip install numpy' will automatically give you the MKL-ified version, because see above re: "closed source". But we could provide a numpy[mkl]-{VERSION}-{COMPAT}.whl with metadata like:: Name: numpy[mkl] Conflicts: numpy Provides: numpy which acts as a drop-in replacement for the regular numpy for those who explicitly request it via ``pip install numpy[mkl]``. This involves two new concepts on top of the ones above: Conflicts: is missing from the current metadata standards but (I think?) trivial to implement in any real resolver. It means "I can't be installed at the same time as something else which matches this requirement". In a sense, it's actually an even more primitive concept than a versioned requirement -- Requires: foo (> 2.0) is equivalent to Requires: foo + Conflicts: foo (<= 2.0), but there's no way to expand an arbitrary Conflicts in terms of Requires. (A minor but important wrinkle: the word "else" is important there; you need a special case saying that a package never conflicts with itself. But I think that's the only tricky bit.) Provides: is trickier -- there's some vestigial support in the current standards and even in pip, but AFAICT it hasn't really been worked out properly. The semantics are obvious enough (Provides: numpy means that this package counts as being numpy; there's some subtleties around what version of numpy it should count as but I think that can be worked out), but it opens a can of worms, because you don't want to allow things like:: Name: numpy Provides: django But once you have the concept of a namespace for multiple distributions from the same project, then you can limit Provides: so that it's only legal if the provider distribution and the provided distribution have the same BASE. This solves the social problem (PyPI knows that numpy[mkl] and numpy are 'owned' by the same people, so this Provides: is OK), and provides algorithmic benefits (if you're trying to find some package that provides foo[extra] out of a flat directory of random distributions, then you only have to examine wheels and sdists that have BASE=foo). The other advantage to having the package be ``numpy[mkl]`` instead of ``numpy-mkl`` is that it correctly encodes that the sdist is ``numpy.zip``, not ``numpy-mkl.zip`` -- the rules we derived to match how extras work now are actually exactly what we want here too. **ABI tracking:** This also solves another use case entirely: the numpy ABI tracking problem (which is probably the single #1 problem the numerical crowd has with current packaging, because it actually prevents us making basic improvements to our code -- the reason I've been making a fuss about other things first is that until now I couldn't figure out any tractable way to solve this problem, but now I have hope). Once you have Provides: and a namespace to use with it, then you can immediately start using "pure virtual" packages to keep track of which ABIs are provided by a single distribution, and determine that these two packages are consistent: Name: numpy Version: 1.9.2 Provides: numpy[abi-2] Provides: numpy[abi-3] Name: scipy Depends: numpy Depends: numpy[abi-2] (AFAICT this would actually make pip *better* than conda as far as numpy's needs are concerned.) The build variants and virtual packages bits also work neatly together. If SciPy wants to provide builds against multiple versions of numpy during the transition period between two ABIs, then these are build variants exactly like numpy[mkl]. For their 0.17.0 release they can upload:: scipy-0.17.0.zip scipy[numpy-abi-2]-0.17.0.whl scipy[numpy-abi-3]-0.17.0.whl (And again, it would be ridiculous to have to register scipy-numpy-abi-2, scipy-numpy-abi-3, etc. on PyPI, and upload separate sdists for each of them. Note that there's nothing magical about the names -- those are just arbitrary tags chosen by the project; what pip would care about is that one of the wheels' metadata says Requires-Dist: numpy[abi-2] and the other says Requires-Dist: numpy[abi-3].) So far as I can tell, these build variant cases therefore cover *all of the situations that were discussed in the previous thread* as reasons why we can't necessarily provide static metadata for an sdist. The numpy sdist can't statically declare a single set of install dependencies for the resulting wheel... but it could give you a menu, and say that it knows how to build numpy.whl, numpy[mkl].whl, or numpy[external-blas].whl, and tell you what the dependencies will be in each case. (And maybe it's also possible to make numpy[custom] by manually editing some configuration file or whatever, but pip would never be called upon to do this so it doesn't need the static metadata.) So I think this would be sufficient to let us start providing full static metadata inside sdists? (Concretely, I imagine that the way this would work is that when we define the new sdist hooks, one of the arguments that pip would pass in when running the build system would be a list of the extras that it's hoping to see, e.g. "the user asked for numpy[mkl], please configure yourself accordingly". For legacy setuptools builds that just use traditional extras, this could safely be ignored.) TL;DR ----- If we: - implement a real resolver, and - add a notion of a per-project namespace of distribution names, that are collected under the same PyPI registration and come from the same sdist, and - add Conflicts:, and Provides:, then we can elegantly solve a collection of important and difficult problems, and we can retroactively pun the old extras system onto the new system in a way that preserves 100% compatibility with all existing packages. I think? What do you think? -n -- Nathaniel J. Smith -- http://vorpus.org

On October 26, 2015 at 3:36:47 AM, Nathaniel Smith (njs@pobox.com) wrote:
TL;DR
If we:
- implement a real resolver, and - add a notion of a per-project namespace of distribution names, that are collected under the same PyPI registration and come from the same sdist, and - add Conflicts:, and Provides:,
then we can elegantly solve a collection of important and difficult problems, and we can retroactively pun the old extras system onto the new system in a way that preserves 100% compatibility with all existing packages.
I think?
What do you think?
My initial reaction when I started reading your idea was that I didn't see a point in having something like foo[bar] be a "real" package when you could just as easily have foo-bar. However, as I continued to read through the idea it started to grow on me. I think I need to let it percolate in my brain a little bit, but there may be a non-crazy (or at least, crazy in a good way) idea here that could push things forward in a nice way. Some random thoughts: * Reusing the extra syntax is nice because it doesn't require end users to learn any new concepts, however we shouldn't take a new syntax off the table either if it makes the feature easier to implement with regards to backwards compatability. Something like numpy{mkl,some-other-thing} could work just as well too. We'll need to make sure that whatever symbols we choose can be represented on all the major FS we care about and that they are ideally non ugly in an URL too. Of course, the filename and user interface symbols don't *need* to match. It could just as easily example numpy[mkl] out to numpy#mkl or whatever which should make it easier to come up with a nice scheme. * Provides is a bit of an odd duck, I think in my head I've mostly come to terms with allowing unrestricted Provides when you've already installed the package doing the Providing but completely ignoring the field when pulling data from a repository. Our threat model assumes that once you've selected to install something then it's generally safe to trust (though we still do try to limit that). The problem with Provides mostly comes into play when you will respect the Provides: field for any random package on PyPI (or any other repo). * The upgrade mess around extras as they stand today could also be solved just by recording what extras (if any) were selected to be installed so that we keep a consistent view of the world. Your proposal is essentially doing that, just by (ab)using the fact that by installing a package we essentially get that aspect of it for "free". * Would this help at all with differentiating between SSE2 and SSE3 builds and things like that? Or does that need something more automatic to be really usable? * PEP 426 (I think it was?) has some extra syntax for extras which could probably be really nice here, things like numpy[*] to get *all* of the extras (though if they are real packages, what even is "all"?). It also included (though this might have been only in my head) default to installed packages which meant you could do something like split numpy into numpy[abi2] and numpy[abi3] packages and have the different ABIs actually contained within those other packages. Then you could have your top level package default to installing abi3 and abi2 so that ``pip install numpy`` is equivilant to ``pip install numpy[abi2,abi3]``. The real power there, is that people can trim down their install a bit by then doing ``pip install numpy[-abi2]`` if they don't want to have that on-by-default feature. I'm going to roll through around in my head more, and hopefully more people with other interesting problems to solve can chime in and say whether they think this would solve their problems or not. My less-than-immediate reaction now that I'm through the entire thing is that it seems like it could be good and I'm struggling to think of a major downside. ----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

On Mon, Oct 26, 2015 at 4:41 AM, Donald Stufft <donald@stufft.io> wrote:
On October 26, 2015 at 3:36:47 AM, Nathaniel Smith (njs@pobox.com) wrote:
TL;DR
If we:
- implement a real resolver, and - add a notion of a per-project namespace of distribution names, that are collected under the same PyPI registration and come from the same sdist, and - add Conflicts:, and Provides:,
then we can elegantly solve a collection of important and difficult problems, and we can retroactively pun the old extras system onto the new system in a way that preserves 100% compatibility with all existing packages.
I think?
What do you think?
My initial reaction when I started reading your idea was that I didn't see a point in having something like foo[bar] be a "real" package when you could just as easily have foo-bar. However, as I continued to read through the idea it started to grow on me. I think I need to let it percolate in my brain a little bit, but there may be a non-crazy (or at least, crazy in a good way) idea here that could push things forward in a nice way.
Oh good, at least I'm not the only one :-). I'd particularly like to hear Robert's thoughts when he has time, since the details depend strongly on some assumptions about how a real resolver would work.
Some random thoughts:
* Reusing the extra syntax is nice because it doesn't require end users to learn any new concepts, however we shouldn't take a new syntax off the table either if it makes the feature easier to implement with regards to backwards compatability. Something like numpy{mkl,some-other-thing} could work just as well too. We'll need to make sure that whatever symbols we choose can be represented on all the major FS we care about and that they are ideally non ugly in an URL too. Of course, the filename and user interface symbols don't *need* to match. It could just as easily example numpy[mkl] out to numpy#mkl or whatever which should make it easier to come up with a nice scheme.
Right -- obviously it would be *nice* to keep the number of concepts down, and to avoid skew between filenames and user interface (because users do see filenames), but if these turn out to be impossible then there are still options that would let us save the per-project-package-namespace idea.
* Provides is a bit of an odd duck, I think in my head I've mostly come to terms with allowing unrestricted Provides when you've already installed the package doing the Providing but completely ignoring the field when pulling data from a repository. Our threat model assumes that once you've selected to install something then it's generally safe to trust (though we still do try to limit that). The problem with Provides mostly comes into play when you will respect the Provides: field for any random package on PyPI (or any other repo).
Yeah, I'm actually not too worried about malicious use either in practice, for the reason you say. But even so I can think of two good reasons we might want to be careful about stating exactly when "Provides:" can be trusted: 1) if you have neither scipy nor numpy installed, and you do 'pip install scipy', and scipy depends on the pure virtual package 'numpy[abi-2]' which is only available as a Provides: on the concrete package 'numpy', then in this case the resolver has to take Provides: into account when pulling data from the repo -- if it doesn't, then it'll ignore the Provides: on 'numpy' and say that scipy's dependencies can't be satisfied. So for this use case to work, we actually do need to be able to sometimes trust Provides: fields. 2) the basic idea of a resolver is that it considers a whole bunch of possible configurations for your environment, and picks the configuration that seems best. But if we pay attention to different metadata when installing as compared to after installation, then this skew makes it possible for the algorithm to pick a configuration that looks good a priori but is broken after installation. E.g. for a simple case: Name: a Conflicts: some-virtual-package Name: b Provides: some-virtual-package 'pip install a b' will work, because the resolver ignores the Provides: and treats the packages as non-conflicting -- but then once installed we have a broken system. This is obviously an artificial example, but creating the possibility of such messes just seems like the kind of headache we don't need. So I think whatever we do with Provides:, we should do the same thing both before and after installation. A simple safe rule is to say that Provides: is always legal iff a package's Name: and Provides: have a matching BASE, and always illegal otherwise, making the package just invalid, like if the METADATA were written in Shift-JIS or something. This rule is trivial to statically check/enforce, and could always be relaxed more later.
* The upgrade mess around extras as they stand today could also be solved just by recording what extras (if any) were selected to be installed so that we keep a consistent view of the world. Your proposal is essentially doing that, just by (ab)using the fact that by installing a package we essentially get that aspect of it for "free".
Right -- you certainly could implement a database of installed extras to go alongside the database of installed packages, but it seems like it just makes things more complicated with minimal benefit. E.g., you have to add special case code to the resolver to check both databases, and then you have to add more special case code to 'pip freeze' to make sure *it* checks both databases... this kind of stuff adds up.
* Would this help at all with differentiating between SSE2 and SSE3 builds and things like that? Or does that need something more automatic to be really usable?
I'm not convinced that SSE2 versus SSE3 is really worth trying to handle automatically, just because we have more urgent issues and everyone else in the world seems to get by okay without special support for this in their package system (even if it's not always optimal). But if we did want to do this then my intuition is that it'd be more elegant to do it via the wheel platform/architecture field, since this actually is a difference in architectures? So you could have one wheel for the "win32" platform and another wheel for the "win32sse3" platform, and the code in the installer that figures out which wheels are compatible would know that both of these are compatible with the machine it was running on (or not), and that win32sse3 is preferable to plain win32.
* PEP 426 (I think it was?) has some extra syntax for extras which could probably be really nice here, things like numpy[*] to get *all* of the extras (though if they are real packages, what even is "all"?). It also included (though this might have been only in my head) default to installed packages which meant you could do something like split numpy into numpy[abi2] and numpy[abi3] packages and have the different ABIs actually contained within those other packages. Then you could have your top level package default to installing abi3 and abi2 so that ``pip install numpy`` is equivilant to ``pip install numpy[abi2,abi3]``. The real power there, is that people can trim down their install a bit by then doing ``pip install numpy[-abi2]`` if they don't want to have that on-by-default feature.
Hmm, right, I'm not thinking of a way to *quite* duplicate this. One option would be to have a numpy[all] package that just depends on all the other extras packages -- for the traditional 'extra' cases this could be autogenerated by setuptools at build time and then be a regular package after that, and for next-generation build systems that had first-class support for these [] packages, it would be up to the build system / project whether to generate such an [all] package and what to include in it if they did. But that doesn't give you the special all-except-for-one behavior. The other option that jumps to mind is what Debian calls "recommends", which act like a soft-dependency: in debian, if numpy recommends: numpy[abi-2] and numpy[abi-3], then 'apt-get install numpy' would give you all three of them by default, just like if numpy required them -- but for recommends: you can also say something like 'apt-get install numpy -numpy[abi-3]' if you want numpy without the abi-3 package, or 'apt-get install --no-recommends numpy' if you want a fully minimal install, and this is okay because these are only *recommendations*, not an actual requirements. I don't see any fundamental reasons why we couldn't add something like this to pip, though it's probably not that urgent. My guess is that these two solutions together would pretty much cover the relevant use cases? -n -- Nathaniel J. Smith -- http://vorpus.org

On Mon, Oct 26, 2015 at 11:41 PM, Nathaniel Smith <njs@pobox.com> wrote:
On October 26, 2015 at 3:36:47 AM, Nathaniel Smith (njs@pobox.com) wrote:
TL;DR
If we:
- implement a real resolver, and - add a notion of a per-project namespace of distribution names, that are collected under the same PyPI registration and come from the same sdist, and - add Conflicts:, and Provides:,
then we can elegantly solve a collection of important and difficult problems, and we can retroactively pun the old extras system onto the new system in a way that preserves 100% compatibility with all existing packages.
I think?
What do you think?
My initial reaction when I started reading your idea was that I didn't see a point in having something like foo[bar] be a "real" package when you could just as easily have foo-bar. However, as I continued to read through the idea it started to grow on me. I think I need to let it percolate in my brain a
On Mon, Oct 26, 2015 at 4:41 AM, Donald Stufft <donald@stufft.io> wrote: little
bit, but there may be a non-crazy (or at least, crazy in a good way) idea here that could push things forward in a nice way.
Oh good, at least I'm not the only one :-).
I'd particularly like to hear Robert's thoughts when he has time, since the details depend strongly on some assumptions about how a real resolver would work.
Some random thoughts:
* Reusing the extra syntax is nice because it doesn't require end users to learn any new concepts, however we shouldn't take a new syntax off the table either if it makes the feature easier to implement with regards to backwards compatability. Something like numpy{mkl,some-other-thing} could work just as well too. We'll need to make sure that whatever symbols we choose can be represented on all the major FS we care about and that they are ideally non ugly in an URL too. Of course, the filename and user interface symbols don't *need* to match. It could just as easily example numpy[mkl] out to numpy#mkl or whatever which should make it easier to come up with a nice scheme.
Right -- obviously it would be *nice* to keep the number of concepts down, and to avoid skew between filenames and user interface (because users do see filenames), but if these turn out to be impossible then there are still options that would let us save the per-project-package-namespace idea.
* Provides is a bit of an odd duck, I think in my head I've mostly come to terms with allowing unrestricted Provides when you've already installed the package doing the Providing but completely ignoring the field when pulling data from a repository. Our threat model assumes that once you've selected to install something then it's generally safe to trust (though we still do try to limit that). The problem with Provides mostly comes into play when you will respect the Provides: field for any random package on PyPI (or any other repo).
Yeah, I'm actually not too worried about malicious use either in practice, for the reason you say. But even so I can think of two good reasons we might want to be careful about stating exactly when "Provides:" can be trusted:
1) if you have neither scipy nor numpy installed, and you do 'pip install scipy', and scipy depends on the pure virtual package 'numpy[abi-2]' which is only available as a Provides: on the concrete package 'numpy', then in this case the resolver has to take Provides: into account when pulling data from the repo -- if it doesn't, then it'll ignore the Provides: on 'numpy' and say that scipy's dependencies can't be satisfied. So for this use case to work, we actually do need to be able to sometimes trust Provides: fields.
2) the basic idea of a resolver is that it considers a whole bunch of possible configurations for your environment, and picks the configuration that seems best. But if we pay attention to different metadata when installing as compared to after installation, then this skew makes it possible for the algorithm to pick a configuration that looks good a priori but is broken after installation. E.g. for a simple case:
Name: a Conflicts: some-virtual-package
Name: b Provides: some-virtual-package
'pip install a b' will work, because the resolver ignores the Provides: and treats the packages as non-conflicting -- but then once installed we have a broken system. This is obviously an artificial example, but creating the possibility of such messes just seems like the kind of headache we don't need. So I think whatever we do with Provides:, we should do the same thing both before and after installation.
Another simple solution for this particular case is to add conflict rules between packages that provide the same requirement (that's what php's composer do IIRC). The case of safety against malicious forks is handled quite explicitly in composer, we may want to look at how they do it when considering solutions (e.g. https://github.com/composer/composer/issues/2690, though it has changed a bit since then) Adding the provides/conflict concepts to pip resolver will complexify it quite significantly, both in terms of running time complexity (since at that point you are solving a NP-complete problem) and in terms of implementation. But we also know for real cases this is doable, even in pure python (composer handles all the cases you are mentioning, and is in pure php). David
A simple safe rule is to say that Provides: is always legal iff a package's Name: and Provides: have a matching BASE, and always illegal otherwise, making the package just invalid, like if the METADATA were written in Shift-JIS or something. This rule is trivial to statically check/enforce, and could always be relaxed more later.
* The upgrade mess around extras as they stand today could also be solved just by recording what extras (if any) were selected to be installed so that we keep a consistent view of the world. Your proposal is essentially doing that, just by (ab)using the fact that by installing a package we essentially get that aspect of it for "free".
Right -- you certainly could implement a database of installed extras to go alongside the database of installed packages, but it seems like it just makes things more complicated with minimal benefit. E.g., you have to add special case code to the resolver to check both databases, and then you have to add more special case code to 'pip freeze' to make sure *it* checks both databases... this kind of stuff adds up.
* Would this help at all with differentiating between SSE2 and SSE3 builds and things like that? Or does that need something more automatic to be really usable?
I'm not convinced that SSE2 versus SSE3 is really worth trying to handle automatically, just because we have more urgent issues and everyone else in the world seems to get by okay without special support for this in their package system (even if it's not always optimal). But if we did want to do this then my intuition is that it'd be more elegant to do it via the wheel platform/architecture field, since this actually is a difference in architectures? So you could have one wheel for the "win32" platform and another wheel for the "win32sse3" platform, and the code in the installer that figures out which wheels are compatible would know that both of these are compatible with the machine it was running on (or not), and that win32sse3 is preferable to plain win32.
* PEP 426 (I think it was?) has some extra syntax for extras which could probably be really nice here, things like numpy[*] to get *all* of the extras (though if they are real packages, what even is "all"?). It also included (though this might have been only in my head) default to installed packages which meant you could do something like split numpy into numpy[abi2] and numpy[abi3] packages and have the different ABIs actually contained within those other packages. Then you could have your top level package default to installing abi3 and abi2 so that ``pip install numpy`` is equivilant to ``pip install numpy[abi2,abi3]``. The real power there, is that people can trim down their install a bit by then doing ``pip install numpy[-abi2]`` if they don't want to have that on-by-default feature.
Hmm, right, I'm not thinking of a way to *quite* duplicate this.
One option would be to have a numpy[all] package that just depends on all the other extras packages -- for the traditional 'extra' cases this could be autogenerated by setuptools at build time and then be a regular package after that, and for next-generation build systems that had first-class support for these [] packages, it would be up to the build system / project whether to generate such an [all] package and what to include in it if they did. But that doesn't give you the special all-except-for-one behavior.
The other option that jumps to mind is what Debian calls "recommends", which act like a soft-dependency: in debian, if numpy recommends: numpy[abi-2] and numpy[abi-3], then 'apt-get install numpy' would give you all three of them by default, just like if numpy required them -- but for recommends: you can also say something like 'apt-get install numpy -numpy[abi-3]' if you want numpy without the abi-3 package, or 'apt-get install --no-recommends numpy' if you want a fully minimal install, and this is okay because these are only *recommendations*, not an actual requirements. I don't see any fundamental reasons why we couldn't add something like this to pip, though it's probably not that urgent.
My guess is that these two solutions together would pretty much cover the relevant use cases?
-n
-- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig

On 27 October 2015 at 21:47, David Cournapeau <cournape@gmail.com> wrote:
Another simple solution for this particular case is to add conflict rules between packages that provide the same requirement (that's what php's composer do IIRC).
The case of safety against malicious forks is handled quite explicitly in composer, we may want to look at how they do it when considering solutions (e.g. https://github.com/composer/composer/issues/2690, though it has changed a bit since then)
Adding the provides/conflict concepts to pip resolver will complexify it quite significantly, both in terms of running time complexity (since at that point you are solving a NP-complete problem) and in terms of implementation. But we also know for real cases this is doable, even in pure python (composer handles all the cases you are mentioning, and is in pure php).
We already require a full NP-complete solver because of <, <= and ~ version operators. I haven't adsorbed this proposal enough to comment on the reification aspect yet. I'm worried about provides and conflicts in general, but not from a resolver code perspective - thats a one-ish-time-cost, but instead from a user experience perspective. -Rob -- Robert Collins <rbtcollins@hp.com> Distinguished Technologist HP Converged Cloud

On Tue, 27 Oct 2015 at 02:17 Robert Collins <robertc@robertcollins.net> wrote:
On 27 October 2015 at 21:47, David Cournapeau <cournape@gmail.com> wrote:
Another simple solution for this particular case is to add conflict rules between packages that provide the same requirement (that's what php's composer do IIRC).
The case of safety against malicious forks is handled quite explicitly in composer, we may want to look at how they do it when considering solutions (e.g. https://github.com/composer/composer/issues/2690, though it has changed a bit since then)
Adding the provides/conflict concepts to pip resolver will complexify it quite significantly, both in terms of running time complexity (since at that point you are solving a NP-complete problem) and in terms of implementation. But we also know for real cases this is doable, even in pure python (composer handles all the cases you are mentioning, and is in pure php).
We already require a full NP-complete solver because of <, <= and ~ version operators.
I haven't adsorbed this proposal enough to comment on the reification aspect yet.
I'm worried about provides and conflicts in general, but not from a resolver code perspective - thats a one-ish-time-cost, but instead from a user experience perspective.
So from my perspective as someone who (I think) grasps what the problems that everyone is trying to solve is but not knowing enough to know how stuff is done now (all my projects on PyPI are pure Python), Nathaniel's proposal makes total sense to me. I would think it would be easy to explain to a scientist that "to get scipy, run `python3.5 -m pip install numpy`, but if you want fast over open source and use Intel's MKL library, do `python3.5 -m pip install numpy[mkl]`. I think the syntax clearly shows it's a modification/tweak/special version of numpy and it makes sense that I want to install something that provides numpy while relying on MKL. Nathaniel's comment about how this might actually give pip a leg up on conda also sounds nice to me as I have enough worry about having a fissure in 1D along the Python 2/3 line, and I'm constantly worried that the scientific community is going to riot and make it a 2D fissure along Python 2/3, pip/conda axes and split effort, documentation, etc.

On Tue, Oct 27, 2015 at 5:45 PM, Brett Cannon <brett@python.org> wrote:
Nathaniel's comment about how this might actually give pip a leg up on conda also sounds nice to me as I have enough worry about having a fissure in 1D along the Python 2/3 line, and I'm constantly worried that the scientific community is going to riot and make it a 2D fissure along Python 2/3, pip/conda axes and split effort, documentation, etc.
If it helps you sleep: I'm confident that no one is planning this particular riot. It takes little work to support pip and conda - the hard issues are mostly with building, not installing. Smaller riots like breaking ``python setup.py install`` recommending ``pip install .`` instead[1] are in the cards though:) Ralf [1] http://article.gmane.org/gmane.comp.python.numeric.general/61757

On 29 Oct 2015 00:31, "Ralf Gommers" <ralf.gommers@gmail.com> wrote:
On Tue, Oct 27, 2015 at 5:45 PM, Brett Cannon <brett@python.org> wrote:
Nathaniel's comment about how this might actually give pip a leg up on
conda also sounds nice to me as I have enough worry about having a fissure in 1D along the Python 2/3 line, and I'm constantly worried that the scientific community is going to riot and make it a 2D fissure along Python 2/3, pip/conda axes and split effort, documentation, etc.
If it helps you sleep: I'm confident that no one is planning this
particular riot. It takes little work to support pip and conda - the hard issues are mostly with building, not installing. Last time I checked "pip in a conda env" was also pretty well behaved, so conda seems to be settling in fairly well to being a per-user cross platform alternative to apt, yum/dnf, homebrew, nix, etc, rather than tackling the same Python specific niche as pip & virtualenv.
Smaller riots like breaking ``python setup.py install`` recommending ``pip install .`` instead[1] are in the cards though:)
Given the PyPA panel at PyCon US a few years ago ended up being subtitled "'./setup.py install' must die", I'd be surprised if that provoked a riot. I guess even if it does, you'll have plenty of folks prepared to help with crowd control :) Cheers, Nick.

On Wed, Oct 28, 2015 at 5:40 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 29 Oct 2015 00:31, "Ralf Gommers" <ralf.gommers@gmail.com> wrote:
On Tue, Oct 27, 2015 at 5:45 PM, Brett Cannon <brett@python.org> wrote:
Nathaniel's comment about how this might actually give pip a leg up on conda also sounds nice to me as I have enough worry about having a fissure in 1D along the Python 2/3 line, and I'm constantly worried that the scientific community is going to riot and make it a 2D fissure along Python 2/3, pip/conda axes and split effort, documentation, etc.
If it helps you sleep: I'm confident that no one is planning this particular riot. It takes little work to support pip and conda - the hard issues are mostly with building, not installing.
Last time I checked "pip in a conda env" was also pretty well behaved, so conda seems to be settling in fairly well to being a per-user cross platform alternative to apt, yum/dnf, homebrew, nix, etc, rather than tackling the same Python specific niche as pip & virtualenv.
I'm not confident I understand all the details of how conda works these days, but AFAIK using pip in a conda env is pretty much the equivalent of using 'sudo pip' in RH/Fedora, i.e. it will happily stomp all over the same files that the package manager thinks it is managing, and probably you will get away with it (until you don't). I could be wrong, though.
Smaller riots like breaking ``python setup.py install`` recommending ``pip install .`` instead[1] are in the cards though:)
Given the PyPA panel at PyCon US a few years ago ended up being subtitled "'./setup.py install' must die", I'd be surprised if that provoked a riot. I guess even if it does, you'll have plenty of folks prepared to help with crowd control :)
I think the biggest pushback so far is from Debian, b/c Debian has standard distro-wide build scripts for python packages that have 'setup.py install' baked in, and there is perhaps some political delicacy to convincing them they should have 'pip install' baked in instead. In the short run I'm guessing we'll end up placating them by giving them an override envvar that lets them keep using setup.py install, but in the longer run this might be a good place to consider directing some crowd control / persuasion. -n -- Nathaniel J. Smith -- http://vorpus.org

On Wed, Oct 28, 2015 at 4:30 PM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Tue, Oct 27, 2015 at 5:45 PM, Brett Cannon <brett@python.org> wrote:
Nathaniel's comment about how this might actually give pip a leg up on conda also sounds nice to me as I have enough worry about having a fissure in 1D along the Python 2/3 line, and I'm constantly worried that the scientific community is going to riot and make it a 2D fissure along Python 2/3, pip/conda axes and split effort, documentation, etc.
If it helps you sleep: I'm confident that no one is planning this particular riot. It takes little work to support pip and conda - the hard issues are mostly with building, not installing.
Well.... I wouldn't say "no one". You weren't there at the NumPy BoF at SciPy this year, where a substantial portion of the room started calling for exactly this, and I felt pretty alone up front trying to squash it almost singlehandedly. It was a bit awkward actually! The argument for numpy dropping pip support is actually somewhat compelling. It goes like this: conda users don't care if numpy breaks ABI, because conda already enforces that numpy-C-API-using-packages have to be recompiled every time a new numpy release comes out. Therefore, if we only supported conda, then we would be free to break ABI and clean up some of the 20 year old broken junk that we have lying around and add new features more quickly. Conclusion: continuing to support pip is hobbling innovation in the whole numerical ecosystem. IMO this is not compelling *enough* to cut off our many many users who are not using conda, plus a schism like this would have all kinds of knock-on costs (the value of a community grows like O(n**2), so splitting a community is expensive!). And given that you and I are both on the list of gatekeepers to such a change, yeah, it's not going to happen in the immediate future. But... if conda continues to gain mindshare at pip's expense, and they fix some of the more controversial sticking points (e.g. the current reliance on secret proprietary build recipes), and the pip/distutils side of things continues to stagnate WRT things like this... I dunno, I could imagine that argument becoming more and more compelling over the next few years. At that point I'm honestly not sure what happens, but I suspect that all the options are unpleasant. You and I have a fair amount of political capital, but it is finite. ...Or maybe I'm worrying over nothing and everything would be fine, but still, it'd be nice if we never have to find out because pip etc. get better enough that the issue goes away. What I'm saying is, it's not a coincidence that it was after SciPy this year that I finally subscribed to distutils-sig :-). -n -- Nathaniel J. Smith -- http://vorpus.org

On 29 October 2015 at 01:16, Nathaniel Smith <njs@pobox.com> wrote:
Well.... I wouldn't say "no one". You weren't there at the NumPy BoF at SciPy this year, where a substantial portion of the room started calling for exactly this, and I felt pretty alone up front trying to squash it almost singlehandedly. It was a bit awkward actually! [...] What I'm saying is, it's not a coincidence that it was after SciPy this year that I finally subscribed to distutils-sig :-).
Ouch. This is a scenario that I (as a casual numpy user with no particular interest in conda) am definitely concerned about - so thanks for fighting that fight on my behalf (and that of many others, I'm sure). Paul

On Thu, Oct 29, 2015 at 2:16 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Wed, Oct 28, 2015 at 4:30 PM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Tue, Oct 27, 2015 at 5:45 PM, Brett Cannon <brett@python.org> wrote:
Nathaniel's comment about how this might actually give pip a leg up on conda also sounds nice to me as I have enough worry about having a
in 1D along the Python 2/3 line, and I'm constantly worried that the scientific community is going to riot and make it a 2D fissure along Python 2/3, pip/conda axes and split effort, documentation, etc.
If it helps you sleep: I'm confident that no one is planning this
fissure particular
riot. It takes little work to support pip and conda - the hard issues are mostly with building, not installing.
Well.... I wouldn't say "no one". You weren't there at the NumPy BoF at SciPy this year, where a substantial portion of the room started calling for exactly this, and I felt pretty alone up front trying to squash it almost singlehandedly. It was a bit awkward actually!
Hmm, guess I missed something. Still confident that it won't happen, because (a) it doesn't make too much sense to me, and (b) there's probably little overlap between the people that want that and the people that do the actual build/packaging maintenance work (outside of conda people themselves).
The argument for numpy dropping pip support is actually somewhat compelling. It goes like this: conda users don't care if numpy breaks ABI, because conda already enforces that numpy-C-API-using-packages have to be recompiled every time a new numpy release comes out. Therefore, if we only supported conda, then we would be free to break ABI and clean up some of the 20 year old broken junk that we have lying around and add new features more quickly. Conclusion: continuing to support pip is hobbling innovation in the whole numerical ecosystem.
IMO this is not compelling *enough* to cut off our many many users who are not using conda,
Agreed. It's also not like those are the only options. If breaking ABI became so valuable that it needs to be done, I'd rather put the burden of that on packagers of projects that rely on numpy and would have to create lots of new installers rather than on users that expect "pip install" to work. Ralf
plus a schism like this would have all kinds of knock-on costs (the value of a community grows like O(n**2), so splitting a community is expensive!). And given that you and I are both on the list of gatekeepers to such a change, yeah, it's not going to happen in the immediate future.
But... if conda continues to gain mindshare at pip's expense, and they fix some of the more controversial sticking points (e.g. the current reliance on secret proprietary build recipes), and the pip/distutils side of things continues to stagnate WRT things like this... I dunno, I could imagine that argument becoming more and more compelling over the next few years. At that point I'm honestly not sure what happens, but I suspect that all the options are unpleasant. You and I have a fair amount of political capital, but it is finite. ...Or maybe I'm worrying over nothing and everything would be fine, but still, it'd be nice if we never have to find out because pip etc. get better enough that the issue goes away.
What I'm saying is, it's not a coincidence that it was after SciPy this year that I finally subscribed to distutils-sig :-).
-n
-- Nathaniel J. Smith -- http://vorpus.org
participants (8)
-
Brett Cannon
-
David Cournapeau
-
Donald Stufft
-
Nathaniel Smith
-
Nick Coghlan
-
Paul Moore
-
Ralf Gommers
-
Robert Collins