Mailman 3 PyPI is a sick sick hoarder - Distutils-SIG

PyPI is a sick sick hoarder

Robert Collins

15 May 2015 15 May '15

6:57 p.m.

So, I am working on pip issue 988: pip doesn't resolve packages at all. This is O(packages^alternatives_per_package): if you are resolving 10 packages with 10 versions each, there are approximately 10^10 or 10G combinations. 10 packages with 100 versions each - 10^100. So - its going to depend pretty heavily on some good heuristics in whatever final algorithm makes its way in, but the problem is exacerbated by PyPI's nature. Most Linux (all that i'm aware of) distributions have at most 5 versions of a package to consider at any time - installed(might be None), current release, current release security updates, new release being upgraded to, new release being upgraded to's security updates. And their common worst case is actually 2 versions: installed==current release and one new release present. They map alternatives out into separate packages (e.g. when an older soname is deliberately kept across an ABI incompatibility, you end up with 2 packages, not 2 versions of one package). To when comparing pip's challenge to apt's: apt has ~20-30K packages, with altnernatives ~= 2, or pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft) Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe. I wonder if it makes sense to give some back-pressure to people, or at the very least encourage them to remove distributions that: - they don't support anymore - have security holes If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions. -Rob -- Robert Collins Distinguished Technologist HP Converged Cloud

Show replies by date

Justin Cappos

15 May 15 May

7:18 p.m.

One thing to consider is that if conflicts do not exist (or are very rare), the number of possible combinations is a moot point. A greedy algorithm for installation (which just chooses the most favored package to resolve each dependency) will run in linear time with the number of packages it would install, if no conflicts exist. So, what you are saying about state exploration may be true for a resolver that uses something like a SAT solver, but doesn't apply to backtracking dependency resolution (unless a huge number of conflicts occur) or simple dependency resolution (at all). SAT solvers do have heuristics to avoid this blow up, except in pathological cases. However, simple / backtracking dependency resolution systems have the further advantage of not needing to request unneeded metadata in the first place... Thanks, Justin On Fri, May 15, 2015 at 2:57 PM, Robert Collins wrote:

...

So, I am working on pip issue 988: pip doesn't resolve packages at all.

This is O(packages^alternatives_per_package): if you are resolving 10 packages with 10 versions each, there are approximately 10^10 or 10G combinations. 10 packages with 100 versions each - 10^100.

So - its going to depend pretty heavily on some good heuristics in whatever final algorithm makes its way in, but the problem is exacerbated by PyPI's nature.

Most Linux (all that i'm aware of) distributions have at most 5 versions of a package to consider at any time - installed(might be None), current release, current release security updates, new release being upgraded to, new release being upgraded to's security updates. And their common worst case is actually 2 versions: installed==current release and one new release present. They map alternatives out into separate packages (e.g. when an older soname is deliberately kept across an ABI incompatibility, you end up with 2 packages, not 2 versions of one package). To when comparing pip's challenge to apt's: apt has ~20-30K packages, with altnernatives ~= 2, or pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft)

Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.

I wonder if it makes sense to give some back-pressure to people, or at the very least encourage them to remove distributions that: - they don't support anymore - have security holes

If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.

-Rob

-- Robert Collins Distinguished Technologist HP Converged Cloud _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig

Robert Collins

7:58 p.m.

On 16 May 2015 at 07:18, Justin Cappos wrote:

...

One thing to consider is that if conflicts do not exist (or are very rare), the number of possible combinations is a moot point. A greedy algorithm for installation (which just chooses the most favored package to resolve each dependency) will run in linear time with the number of packages it would install, if no conflicts exist.

So, what you are saying about state exploration may be true for a resolver that uses something like a SAT solver, but doesn't apply to backtracking dependency resolution (unless a huge number of conflicts occur) or simple dependency resolution (at all). SAT solvers do have heuristics to avoid this blow up, except in pathological cases. However, simple / backtracking dependency resolution systems have the further advantage of not needing to request unneeded metadata in the first place...

Your intuition here is misleading, sorry :(. You're right about 'hey if everything fits its linear', but the reason we have this bug open, with people adding a new example of it every week or so (as someone who didn't realise that pip doesn't resolve finds out and adds it to pip's issue tracker, only to have it made into a dupe). Backtracking recursive resolvers have exactly the same O as SAT. Example: say I have an ecosystem of 10 packages. A-J. And they do a release every 6 months that is guaranteed to work together, but every time some issue occurs which ends up clamping the group together- e.g. an external release breaks API and so A1s deps are disjoint with A2s, and then the same between A2 and A3. Even though A1's API is compatible with B2's: its not internal bad code, its just taking *one* external dep breaking its API. After 2 releases you have 10^2 combinations, but only 4 are valid at all. Thats 4%. 8 releases gets you 10^8, 8 valid combinations, or 0.0000008%. Now there are two things to examine here. How likely is this to happen to PyPI users, and can a backtracker (which btw is what my code is) handle this better than a SAT solver. In terms of likelyhood - OpenStack hits this every release. Its not that our libraries are incompatible with each other, its that given 250 packages (the 200 in error I quoted just shows that the resolver hadn't obtained version data for everything), *something* breaks API compat in each 6 month release cycle, and so you end up with the whole set effectively locking together. In fact, it has happened so consistently to OpenStack that we now release our libraries with closed specifiers : >=min_version, 3.0 (and we have releases 1,2,3,4 of A through J). Now, for a_version in A: is going to have the installed version of A in its first step. B likewise. So we'll end up with a trace like: A==3 B==3 C==3 D==3 E==3 F==3 G==3 H==3 I==3 J==3 error, backtrack (user specifier) J==4 error, backtracks once the external deps are considered and the conflict is found J==2 error, backtrack (user specifier) J==1 error, backtrack (user specifier) I==4 J==3 error, backtrack (user specifier) J==4 error, backtracks somewere in the external deps J==2, error, backtrack (user specifier) and so on, until we finally tick over to A==4. More generally, any already installed version (without -U) can cause a backtracking resolver to try *all* possible other combinations before realising that that installed version is the problem and bumping it. A heuristic to look for those and bump them first then hits molasses as soon as one of the installed versions needs to be kept as-is. Anyhow, my goal here is to start the conversation; pip will need some knobs because no matter how good the heuristics users will need escape hatches. (One of which is to fully specify their needs). -Rob -- Robert Collins Distinguished Technologist HP Converged Cloud

Justin Cappos

8:46 p.m.

...

Example: say I have an ecosystem of 10 packages. A-J. And they do a release every 6 months that is guaranteed to work together, but every time some issue occurs which ends up clamping the group together- e.g. an external release breaks API and so A1s deps are disjoint with A2s, and then the same between A2 and A3. Even though A1's API is compatible with B2's: its not internal bad code, its just taking *one* external dep breaking its API.

After 2 releases you have 10^2 combinations, but only 4 are valid at all. Thats 4%. 8 releases gets you 10^8, 8 valid combinations, or 0.0000008%.

Yes, so this would not be a situation where "conflicts do not exist (or are very rare)" as my post mentioned. Is this rate of conflicts something you measured or is it a value you made up? I don't hear anyone arguing that the status quo makes sense. I think we're mostly just chatting about the right thing to optimize the solution for and what sorts of short cuts may be useful (or even necessary). Since we can measure the actual conflict and other values in practice, data seems like it may be a good path toward grounding the discussion... Thanks, Justin

Robert Collins

9:06 p.m.

On 16 May 2015 at 08:46, Justin Cappos wrote:

...

...
Example: say I have an ecosystem of 10 packages. A-J. And they do a release every 6 months that is guaranteed to work together, but every time some issue occurs which ends up clamping the group together- e.g. an external release breaks API and so A1s deps are disjoint with A2s, and then the same between A2 and A3. Even though A1's API is compatible with B2's: its not internal bad code, its just taking *one* external dep breaking its API.

After 2 releases you have 10^2 combinations, but only 4 are valid at all. Thats 4%. 8 releases gets you 10^8, 8 valid combinations, or 0.0000008%.

Yes, so this would not be a situation where "conflicts do not exist (or are very rare)" as my post mentioned. Is this rate of conflicts something you measured or is it a value you made up?

It's drawn from the concrete example of OpenStack, which has a single group of co-installable releases that cluster together every 6 months. I don't have the actual valid/invalid ratio there because I don't have enough machines to calculate it:). -Rob -- Robert Collins Distinguished Technologist HP Converged Cloud

Donald Stufft

7:19 p.m.

...

On May 15, 2015, at 2:57 PM, Robert Collins wrote:

So, I am working on pip issue 988: pip doesn't resolve packages at all.

This is O(packages^alternatives_per_package): if you are resolving 10 packages with 10 versions each, there are approximately 10^10 or 10G combinations. 10 packages with 100 versions each - 10^100.

So - its going to depend pretty heavily on some good heuristics in whatever final algorithm makes its way in, but the problem is exacerbated by PyPI's nature.

Most Linux (all that i'm aware of) distributions have at most 5 versions of a package to consider at any time - installed(might be None), current release, current release security updates, new release being upgraded to, new release being upgraded to's security updates. And their common worst case is actually 2 versions: installed==current release and one new release present. They map alternatives out into separate packages (e.g. when an older soname is deliberately kept across an ABI incompatibility, you end up with 2 packages, not 2 versions of one package). To when comparing pip's challenge to apt's: apt has ~20-30K packages, with altnernatives ~= 2, or pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft)

Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.

I wonder if it makes sense to give some back-pressure to people, or at the very least encourage them to remove distributions that: - they don't support anymore - have security holes

If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.

There have been a handful of projects which would only keep the latest N versions uploaded to PyPI. I know this primarily because it has caused people a decent amount of pain over time. It’s common for deployments people have to use a requirements.txt file like ``foo==1.0`` and to just continue to pull from PyPI. Deleting the old files breaks anyone doing that, so it would require either having people bundle their deps in their repositories or some way to get at those old versions. Personally I think that we shouldn’t go deleting the old versions or encouraging people to do that. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Ben Finney

7:24 p.m.

Donald Stufft writes:

...

...
On May 15, 2015, at 2:57 PM, Robert Collins wrote:

If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.

It’s common for deployments people have to use a requirements.txt file like ``foo==1.0`` and to just continue to pull from PyPI. Deleting the old files breaks anyone doing that, so it would require either having people bundle their deps in their repositories or some way to get at those old versions. Personally I think that we shouldn’t go deleting the old versions or encouraging people to do that.

Yes, it's common to consider PyPI as a repository of all versions ever released, and to treat it as an archive whose URLs will continue to make available the historical versions. -- \ “If history and science have taught us anything, it is that | `\ passion and desire are not the same as truth.” —E. O. Wilson, | _o__) _Consilience_, 1998 | Ben Finney

Robert Collins

7:26 p.m.

On 16 May 2015 at 07:19, Donald Stufft wrote:

...

There have been a handful of projects which would only keep the latest N versions uploaded to PyPI. I know this primarily because it has caused people a decent amount of pain over time. It’s common for deployments people have to use a requirements.txt file like ``foo==1.0`` and to just continue to pull from PyPI. Deleting the old files breaks anyone doing that, so it would require either having people bundle their deps in their repositories or some way to get at those old versions. Personally I think that we shouldn’t go deleting the old versions or encouraging people to do that.

I think 'most recent only' is too much. Most upstreams will support more than one release. Like - I don't care what testtools release you use. OTOH, every version with distinct dependencies becomes a very expensive liability to the ecosystem here. It's beyond human scale, and well in the territory of argh wtf the universe is burning around me and my tardis has run out of power. I'm sure we can provide an escape hatch in pip (and I'm going to do that in my branch soon - offering simple 'error on conflict' and 'use first seen specifier only' strategies) while folk work on different heuristics - the actual resolver is only ~100 LOC in my branch today - the rest is refactoring (that can be made better and I plan to do so before suggesting we merge it). But a significant contributing factor is the O of the problem, and we can do something about that. I don't know what exactly, and I think we're going to need to have our creative caps firmly on to come up with something meeting the broad needs of the ecosystem: which includes pip Just Working. -Rob -- Robert Collins Distinguished Technologist HP Converged Cloud

Noah Kantrowitz

7:51 p.m.

On May 15, 2015, at 9:19 PM, Donald Stufft wrote:

...

...
On May 15, 2015, at 2:57 PM, Robert Collins wrote:

So, I am working on pip issue 988: pip doesn't resolve packages at all.

This is O(packages^alternatives_per_package): if you are resolving 10 packages with 10 versions each, there are approximately 10^10 or 10G combinations. 10 packages with 100 versions each - 10^100.

So - its going to depend pretty heavily on some good heuristics in whatever final algorithm makes its way in, but the problem is exacerbated by PyPI's nature.

Most Linux (all that i'm aware of) distributions have at most 5 versions of a package to consider at any time - installed(might be None), current release, current release security updates, new release being upgraded to, new release being upgraded to's security updates. And their common worst case is actually 2 versions: installed==current release and one new release present. They map alternatives out into separate packages (e.g. when an older soname is deliberately kept across an ABI incompatibility, you end up with 2 packages, not 2 versions of one package). To when comparing pip's challenge to apt's: apt has ~20-30K packages, with altnernatives ~= 2, or pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft)

Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.

I wonder if it makes sense to give some back-pressure to people, or at the very least encourage them to remove distributions that: - they don't support anymore - have security holes

If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.

There have been a handful of projects which would only keep the latest N versions uploaded to PyPI. I know this primarily because it has caused people a decent amount of pain over time. It’s common for deployments people have to use a requirements.txt file like ``foo==1.0`` and to just continue to pull from PyPI. Deleting the old files breaks anyone doing that, so it would require either having people bundle their deps in their repositories or some way to get at those old versions. Personally I think that we shouldn’t go deleting the old versions or encouraging people to do that.

+1 for this. While I appreciate why Linux distress purge old versions, it is absolutely hellish for reproducibility. If you are looking for prior art, check out the Molinillo project (https://github.com/CocoaPods/Molinillo) used by Bundler and CocoaPods. It is not as complex as the Solve gem used in Chef but offers a good balance of performance in satisfying constraints and false negatives on solution failures. --Noah

Robert Collins

7:20 p.m.

On 16 May 2015 at 06:57, Robert Collins wrote:

...

So, I am working on pip issue 988: pip doesn't resolve packages at all.

This is O(packages^alternatives_per_package): if you are resolving 10 ... Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.

I added a check for the exact number (when the current step limit is hit): Hit step limit during resolving, 22493640689038530013767184665222125808455708963348534886974974630893524036813561125576881299950281714638872640331745747555743820280235291929928862660035516365300612827387994788286647556890876840654454905860390366740480.000000 from 4038 versions in 205 packages after 100000 steps Which indicates a alternatives factor of ~20. And AIUI PyPI has a long tail itself, so its more common that folk will see > 5.7 factors, rather than less common. -Rob -- Robert Collins Distinguished Technologist HP Converged Cloud

Jim Fulton

8:27 p.m.

On Fri, May 15, 2015 at 2:57 PM, Robert Collins wrote:

...

So, I am working on pip issue 988: pip doesn't resolve packages at all.

This is O(packages^alternatives_per_package): if you are resolving 10 packages with 10 versions each, there are approximately 10^10 or 10G combinations. 10 packages with 100 versions each - 10^100.

So - its going to depend pretty heavily on some good heuristics in whatever final algorithm makes its way in, but the problem is exacerbated by PyPI's nature.

Most Linux (all that i'm aware of) distributions have at most 5 versions of a package to consider at any time - installed(might be None), current release, current release security updates, new release being upgraded to, new release being upgraded to's security updates. And their common worst case is actually 2 versions: installed==current release and one new release present. They map alternatives out into separate packages (e.g. when an older soname is deliberately kept across an ABI incompatibility, you end up with 2 packages, not 2 versions of one package). To when comparing pip's challenge to apt's: apt has ~20-30K packages, with altnernatives ~= 2, or pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft)

Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.

I wonder if it makes sense to give some back-pressure to people, or at the very least encourage them to remove distributions that: - they don't support anymore - have security holes

If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.

You could simply limit the number of versions from PyPI you consider. Jim -- Jim Fulton http://www.linkedin.com/in/jimfulton

Robert Collins

8:28 p.m.

On 16 May 2015 at 08:27, Jim Fulton wrote:

...

...
If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.

You could simply limit the number of versions from PyPI you consider.

Yes - it would be nice IMO to give package authors some influence over that though. -Rob -- Robert Collins Distinguished Technologist HP Converged Cloud

Marcus Smith

11:08 p.m.

Why not start with pip at least being a "simple" fail-on-conflict resolver (vs the "1st found wins" resolver it is now)... You'd "backtrack" for the sake of re-walking when new constraints are found, but not for the purpose of solving conflicts. I know you're motivated to solve Openstack build issues, but many of the issues I've seen in the pip tracker, I think would be solved without the backtracking resolver you're trying to build. On Fri, May 15, 2015 at 11:57 AM, Robert Collins wrote:

...

So, I am working on pip issue 988: pip doesn't resolve packages at all.

This is O(packages^alternatives_per_package): if you are resolving 10 packages with 10 versions each, there are approximately 10^10 or 10G combinations. 10 packages with 100 versions each - 10^100.

So - its going to depend pretty heavily on some good heuristics in whatever final algorithm makes its way in, but the problem is exacerbated by PyPI's nature.

Most Linux (all that i'm aware of) distributions have at most 5 versions of a package to consider at any time - installed(might be None), current release, current release security updates, new release being upgraded to, new release being upgraded to's security updates. And their common worst case is actually 2 versions: installed==current release and one new release present. They map alternatives out into separate packages (e.g. when an older soname is deliberately kept across an ABI incompatibility, you end up with 2 packages, not 2 versions of one package). To when comparing pip's challenge to apt's: apt has ~20-30K packages, with altnernatives ~= 2, or pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft)

Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.

I wonder if it makes sense to give some back-pressure to people, or at the very least encourage them to remove distributions that: - they don't support anymore - have security holes

If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.

-Rob

-- Robert Collins Distinguished Technologist HP Converged Cloud _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig

Robert Collins

16 May 16 May

1:22 a.m.

On 16 May 2015 at 11:08, Marcus Smith wrote:

...

Why not start with pip at least being a "simple" fail-on-conflict resolver (vs the "1st found wins" resolver it is now)...

You'd "backtrack" for the sake of re-walking when new constraints are found, but not for the purpose of solving conflicts.

I know you're motivated to solve Openstack build issues, but many of the issues I've seen in the pip tracker, I think would be solved without the backtracking resolver you're trying to build.

Well, I'm scratching the itch I have. If its too hard to get something decent, sure I might back off in my goals, but I see no point aiming for something less than all the other language specific packaging systems out there have. -Rob -- Robert Collins Distinguished Technologist HP Converged Cloud

Donald Stufft

1:45 a.m.

...

On May 15, 2015, at 9:22 PM, Robert Collins wrote:

On 16 May 2015 at 11:08, Marcus Smith wrote:

...
Why not start with pip at least being a "simple" fail-on-conflict resolver (vs the "1st found wins" resolver it is now)...

You'd "backtrack" for the sake of re-walking when new constraints are found, but not for the purpose of solving conflicts.

I know you're motivated to solve Openstack build issues, but many of the issues I've seen in the pip tracker, I think would be solved without the backtracking resolver you're trying to build.

Well, I'm scratching the itch I have. If its too hard to get something decent, sure I might back off in my goals, but I see no point aiming for something less than all the other language specific packaging systems out there have.

So what makes the other language specific packaging systems different? As far as I know all of them have complete archives (e.g. they are like PyPI where they have a lot of versions, not like Linux Distros). What can we learn from how they solved this? --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Robert Collins

1:52 a.m.

On 16 May 2015 at 13:45, Donald Stufft wrote:

...

...
On May 15, 2015, at 9:22 PM, Robert Collins wrote:

On 16 May 2015 at 11:08, Marcus Smith wrote:

...
Why not start with pip at least being a "simple" fail-on-conflict resolver (vs the "1st found wins" resolver it is now)...

You'd "backtrack" for the sake of re-walking when new constraints are found, but not for the purpose of solving conflicts.

I know you're motivated to solve Openstack build issues, but many of the issues I've seen in the pip tracker, I think would be solved without the backtracking resolver you're trying to build.

Well, I'm scratching the itch I have. If its too hard to get something decent, sure I might back off in my goals, but I see no point aiming for something less than all the other language specific packaging systems out there have.

So what makes the other language specific packaging systems different? As far as I know all of them have complete archives (e.g. they are like PyPI where they have a lot of versions, not like Linux Distros). What can we learn from how they solved this?

NB; I have by no means finished low hanging heuristics and space trimming stuff :). I have some simple things in mind and am sure I'll end up with something 'good enough' for day to day use. The thing I'm worried about is the long term health of the approach. Good questions. Some of it is structural I suspect. A quick rundown. cabal (haskell) has a backtracking solver that accepts various parameters to tell it to try harder. javascript effectively vendors every dep ever, so you end up with many copies of the same library at different versions in the same process. rust's cargo system currently solves everything in a single project only - it has no binary packaging, only vendor-into-a-binary-build packaging. The gem behaviour I'm not yet familiar with. perl I used to know but time has eroded it :/. -Rob -- Robert Collins Distinguished Technologist HP Converged Cloud

David Cournapeau

5 a.m.

On Sat, May 16, 2015 at 10:52 AM, Robert Collins wrote:

...

On 16 May 2015 at 13:45, Donald Stufft wrote:

...
...
On May 15, 2015, at 9:22 PM, Robert Collins

...
...
On 16 May 2015 at 11:08, Marcus Smith wrote:

...
Why not start with pip at least being a "simple" fail-on-conflict

resolver

...
...
(vs the "1st found wins" resolver it is now)...

You'd "backtrack" for the sake of re-walking when new constraints are found, but not for the purpose of solving conflicts.

I know you're motivated to solve Openstack build issues, but many of

...
...
...
issues I've seen in the pip tracker, I think would be solved without

wrote: the the

...
...
...
backtracking resolver you're trying to build.

Well, I'm scratching the itch I have. If its too hard to get something decent, sure I might back off in my goals, but I see no point aiming for something less than all the other language specific packaging systems out there have.

So what makes the other language specific packaging systems different? As far as I know all of them have complete archives (e.g. they are like PyPI where they have a lot of versions, not like Linux Distros). What can we learn from how they solved this?

NB; I have by no means finished low hanging heuristics and space trimming stuff :). I have some simple things in mind and am sure I'll end up with something 'good enough' for day to day use. The thing I'm worried about is the long term health of the approach.

Good questions. Some of it is structural I suspect. A quick rundown. cabal (haskell) has a backtracking solver that accepts various parameters to tell it to try harder. javascript effectively vendors every dep ever, so you end up with many copies of the same library at different versions in the same process. rust's cargo system currently solves everything in a single project only - it has no binary packaging, only vendor-into-a-binary-build packaging. The gem behaviour I'm not yet familiar with. perl I used to know but time has eroded it :/.

FWIW, php uses a SAT-based solver in composer, which started as a port of libsolv (the SAT solver used by openSUSE and soon Fedora). I am no expert, but I don't understand why backtracking algorithms would to be faster than SAT, since they both potentially need to walk over the full set of possible solutions. It is hard to reason about the cost because the worst case is in theory growing exponentially in both cases. With a SAT-based algorithm for dependency resolution, it is relatively simple to apply heuristics which massively prune the search space. For example, when considering package A with say 10 potential versions A_1, etc..., in theory, you need to generate the rules: # - means not install, + means install - A_1 | - A_2 - A_1 | - A_3 ... and those constitute most of the rules in common cases. But it is possible to tweak the SAT implementation to replace those rules by a single "AtMost one of" rule per *package*, which means the #rules do not grow much by versions. The real difficulty of SAT-based solver is the optimization part: many actually valid solutions are not acceptable, and that's where the heuristics get more complicated. David

Justin Cappos

2:36 p.m.

...

I am no expert, but I don't understand why backtracking algorithms would to be faster than SAT, since they both potentially need to walk over the full set of possible solutions. It is hard to reason about the cost because the worst case is in theory growing exponentially in both cases.

This is talked about a bit in this thread: https://github.com/pypa/pip/issues/988 Each algorithm could be computationally more efficient. Basically, *if there are no conflicts* backtracking will certainly win. If there are a huge number of conflicts a SAT solver will certainly win. It's not clear where the tipping point is between the two schemes. However, a better question is does the computational difference matter? If one is a microsecond faster than the other, I don't think anyone cares. However, from the OPIUM paper (listed off of that thread), it is clear that SAT solver resolution can be slow without optimizations to make them work more like backtracking resolvers. From my experience backtracking resolvers are also slow when the conflict rate is high. This only considers computation cost though. Other factors can become more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them. The bandwidth cost for SAT solvers should be higher. Thanks, Justin P.S. If you'd like to talk off list, possibly over Skype, I'd be happy to talk more with you and/or Robert about minutiae that others may not care about.

David Cournapeau

3:22 p.m.

On Sat, May 16, 2015 at 11:36 PM, Justin Cappos wrote:

...

I am no expert, but I don't understand why backtracking algorithms would

...
to be faster than SAT, since they both potentially need to walk over the full set of possible solutions. It is hard to reason about the cost because the worst case is in theory growing exponentially in both cases.

This is talked about a bit in this thread: https://github.com/pypa/pip/issues/988

Each algorithm could be computationally more efficient. Basically, *if there are no conflicts* backtracking will certainly win. If there are a huge number of conflicts a SAT solver will certainly win. It's not clear where the tipping point is between the two schemes.

However, a better question is does the computational difference matter? If one is a microsecond faster than the other, I don't think anyone cares. However, from the OPIUM paper (listed off of that thread), it is clear that SAT solver resolution can be slow without optimizations to make them work more like backtracking resolvers. From my experience backtracking resolvers are also slow when the conflict rate is high.

Pure SAT is fast enough in practice in my experience (concretely: solving thousand of rules takes < 1 sec). It becomes more complicated as you need to optimize the solution, especially when you have already installed packages. This is unfortunately not as well discussed in the literature. Pseudo-boolean SAT for optimization was argued to be too slow by the 0 install people, but OTOH, this seems to be what's used in conda, so who knows :) If you SAT solver is in pure python, you can choose a "direction" of the search which is more meaningful. I believe this is what 0install does from reading http://0install.net/solver.html, and what we have in our own SAT solver code. I unfortunately cannot look at the 0install code myself as it is under the GPL and am working on a BSD solver implementation. I also do not know how they handle updates and already installed packages.

...

This only considers computation cost though. Other factors can become more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them. The bandwidth cost for SAT solvers should be higher.

With a reasonable representation, I think you can make it small enough. To give an idea, our index @ Enthought containing around 20k packages takes ~340 kb compressed w/ bz2 if you only keep the data required for dependency handling (name, version and runtime dependencies), and that's using json, an inefficient encoding, so I suspect encoding all of pypi may be a few MB only fetch, which is generally faster that doing tens of http requests. The libsvolv people worked on a binary representation that may also be worth looking at.

...

P.S. If you'd like to talk off list, possibly over Skype, I'd be happy to talk more with you and/or Robert about minutiae that others may not care about.

Sure, I would be happy too. As I mentioned before, we have some code around a SAT-based solver, but it is not ready yet, which is why we kept it private (https://github.com/enthought/sat-solver). It handles well (== both speed and quality-wise) the case where nothing is installed, but behaves poorly when packages are already installed, and does not handle the update case yet. The code is also very prototype-ish, but is not too complicated to experimente with it. David

Daniel Holth

3:40 p.m.

On May 16, 2015 11:22 AM, "David Cournapeau" wrote:

...

On Sat, May 16, 2015 at 11:36 PM, Justin Cappos wrote:

...
...
I am no expert, but I don't understand why backtracking algorithms

...

...
This is talked about a bit in this thread:

https://github.com/pypa/pip/issues/988

...
Each algorithm could be computationally more efficient. Basically, *if

...

...
However, a better question is does the computational difference matter?

If one is a microsecond faster than the other, I don't think anyone cares. However, from the OPIUM paper (listed off of that thread), it is clear that SAT solver resolution can be slow without optimizations to make them work more like backtracking resolvers. From my experience backtracking resolvers are also slow when the conflict rate is high.

Pure SAT is fast enough in practice in my experience (concretely: solving

...

If you SAT solver is in pure python, you can choose a "direction" of the search which is more meaningful. I believe this is what 0install does from reading http://0install.net/solver.html, and what we have in our own SAT solver code. I unfortunately cannot look at the 0install code myself as it is under the GPL and am working on a BSD solver implementation. I also do not know how they handle updates and already installed packages.

...
This only considers computation cost though. Other factors can become

more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them. The bandwidth cost for SAT solvers should be higher.

With a reasonable representation, I think you can make it small enough. To give an idea, our index @ Enthought containing around 20k packages takes ~340 kb compressed w/ bz2 if you only keep the data required for dependency handling (name, version and runtime dependencies), and that's using json, an inefficient encoding, so I suspect encoding all of pypi may be a few MB only fetch, which is generally faster that doing tens of http requests.

The libsvolv people worked on a binary representation that may also be worth looking at.

...
P.S. If you'd like to talk off list, possibly over Skype, I'd be happy

to talk more with you and/or Robert about minutiae that others may not care about.

Sure, I would be happy too. As I mentioned before, we have some code around a SAT-based solver, but it is not ready yet, which is why we kept it

would to be faster than SAT, since they both potentially need to walk over the full set of possible solutions. It is hard to reason about the cost because the worst case is in theory growing exponentially in both cases. there are no conflicts* backtracking will certainly win. If there are a huge number of conflicts a SAT solver will certainly win. It's not clear where the tipping point is between the two schemes. thousand of rules takes < 1 sec). It becomes more complicated as you need to optimize the solution, especially when you have already installed packages. This is unfortunately not as well discussed in the literature. Pseudo-boolean SAT for optimization was argued to be too slow by the 0 install people, but OTOH, this seems to be what's used in conda, so who knows :) Where optimizing means something like "find a solution with the newest possible releases of the required packages", not execution speed. private (https://github.com/enthought/sat-solver). It handles well (== both speed and quality-wise) the case where nothing is installed, but behaves poorly when packages are already installed, and does not handle the update case yet. The code is also very prototype-ish, but is not too complicated to experimente with it.

...

David

_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig

David Cournapeau

3:48 p.m.

On Sun, May 17, 2015 at 12:40 AM, Daniel Holth wrote:

...

On May 16, 2015 11:22 AM, "David Cournapeau" wrote:

...
On Sat, May 16, 2015 at 11:36 PM, Justin Cappos wrote:

...
...
I am no expert, but I don't understand why backtracking algorithms

...
...
This is talked about a bit in this thread:

https://github.com/pypa/pip/issues/988

...
Each algorithm could be computationally more efficient. Basically, *if

...
...
However, a better question is does the computational difference

matter? If one is a microsecond faster than the other, I don't think anyone cares. However, from the OPIUM paper (listed off of that thread), it is clear that SAT solver resolution can be slow without optimizations to make them work more like backtracking resolvers. From my experience backtracking resolvers are also slow when the conflict rate is high.

Pure SAT is fast enough in practice in my experience (concretely: solving thousand of rules takes < 1 sec). It becomes more complicated as you need to optimize the solution, especially when you have already installed packages. This is unfortunately not as well discussed in the

would to be faster than SAT, since they both potentially need to walk over the full set of possible solutions. It is hard to reason about the cost because the worst case is in theory growing exponentially in both cases. there are no conflicts* backtracking will certainly win. If there are a huge number of conflicts a SAT solver will certainly win. It's not clear where the tipping point is between the two schemes. literature. Pseudo-boolean SAT for optimization was argued to be too slow by the 0 install people, but OTOH, this seems to be what's used in conda, so who knows :)

Where optimizing means something like "find a solution with the newest possible releases of the required packages", not execution speed.

Indeed, it was not obvious in this context :) Though in theory, optimization is more general. It could be optimizing w.r.t. a cost function taking into account #packages, download size, minimal number of changes, etc... This is where you want a pseudo-boolean SAT, which is what conda uses I think. 0install, composer and I believe libsolv took a different route, and use heuristics to find a reasonably good solution by picking the next candidate. This requires access to the internals of the SAT solver though (not a problem if you have a python implementation). David

...

...
If you SAT solver is in pure python, you can choose a "direction" of the search which is more meaningful. I believe this is what 0install does from reading http://0install.net/solver.html, and what we have in our own SAT solver code. I unfortunately cannot look at the 0install code myself as it is under the GPL and am working on a BSD solver implementation. I also do not know how they handle updates and already installed packages.

...
This only considers computation cost though. Other factors can become

more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them. The bandwidth cost for SAT solvers should be higher.

With a reasonable representation, I think you can make it small enough. To give an idea, our index @ Enthought containing around 20k packages takes ~340 kb compressed w/ bz2 if you only keep the data required for dependency handling (name, version and runtime dependencies), and that's using json, an inefficient encoding, so I suspect encoding all of pypi may be a few MB only fetch, which is generally faster that doing tens of http requests.

The libsvolv people worked on a binary representation that may also be worth looking at.

...
P.S. If you'd like to talk off list, possibly over Skype, I'd be happy

to talk more with you and/or Robert about minutiae that others may not care about.

Sure, I would be happy too. As I mentioned before, we have some code around a SAT-based solver, but it is not ready yet, which is why we kept it private (https://github.com/enthought/sat-solver). It handles well (== both speed and quality-wise) the case where nothing is installed, but behaves poorly when packages are already installed, and does not handle the update case yet. The code is also very prototype-ish, but is not too complicated to experimente with it.

David

_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig

Nick Coghlan

5:24 p.m.

On 17 May 2015 at 00:36, Justin Cappos wrote:

...

This only considers computation cost though. Other factors can become more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them.

This is the defining consideration for pip at this point: a SAT solver requires publication of static dependency metadata on PyPI, which is dependent on both the Warehouse migration *and* the completion and acceptance of PEP 426. Propagation out to PyPI caching proxies and mirrors like devpi and the pulp-python plugin will then take even longer. A backtracking resolver doesn't have those gating dependencies, as it can tolerate the current dynamic metadata model. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Donald Stufft

5:28 p.m.

...

On May 16, 2015, at 1:24 PM, Nick Coghlan wrote:

On 17 May 2015 at 00:36, Justin Cappos wrote:

...
This only considers computation cost though. Other factors can become more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them.

This is the defining consideration for pip at this point: a SAT solver requires publication of static dependency metadata on PyPI, which is dependent on both the Warehouse migration *and* the completion and acceptance of PEP 426. Propagation out to PyPI caching proxies and mirrors like devpi and the pulp-python plugin will then take even longer.

A backtracking resolver doesn't have those gating dependencies, as it can tolerate the current dynamic metadata model.

Even when we have Warehouse and PEP 426, that only gives us that data going forward, the 400k files that currently exist on PyPI still won’t have static metadata. We could parse it out for Wheels but not for anything else. For the foreseeable future any solution will need to be able to handle iteratively finding constraints. Though I think a SAT solver can do it if it can handle incremental solving or just by re-doing the SAT problem each time we discover a new constraint. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

Nick Coghlan

12:02 p.m.

On 16 May 2015 at 11:52, Robert Collins wrote:

...

On 16 May 2015 at 13:45, Donald Stufft wrote:

...
...
On May 15, 2015, at 9:22 PM, Robert Collins wrote:

On 16 May 2015 at 11:08, Marcus Smith wrote:

...
Why not start with pip at least being a "simple" fail-on-conflict resolver (vs the "1st found wins" resolver it is now)...

You'd "backtrack" for the sake of re-walking when new constraints are found, but not for the purpose of solving conflicts.

I know you're motivated to solve Openstack build issues, but many of the issues I've seen in the pip tracker, I think would be solved without the backtracking resolver you're trying to build.

Well, I'm scratching the itch I have. If its too hard to get something decent, sure I might back off in my goals, but I see no point aiming for something less than all the other language specific packaging systems out there have.

So what makes the other language specific packaging systems different? As far as I know all of them have complete archives (e.g. they are like PyPI where they have a lot of versions, not like Linux Distros). What can we learn from how they solved this?

NB; I have by no means finished low hanging heuristics and space trimming stuff :). I have some simple things in mind and am sure I'll end up with something 'good enough' for day to day use. The thing I'm worried about is the long term health of the approach.

Longer term, I think it makes sense to have the notion of "active" and "obsolete" versions baked into PyPI's API and the web UI. This wouldn't be baked into the package metadata itself (unlike the proposed "Obsoleted-By" field for project renaming), but rather be a dynamic reflection of whether or not *new* users should be looking at the affected version, and whether or not it should be considered as a candidate for dependency resolution when not specifically requested. (This could also replace the current "hidden versions" feature, which only hides things from the web UI, without having any impact on the information published to automated tools through the programmatic API) Tools that list outdated packages could also be simplified a bit, as their first pass could just be to check the obsolescence markers on installed packages, with the second pass being to check for newer versions of those packages. While the bare minimum would be to let project mantainers set the obsolescence flag directly, we could also potentially offer projects some automated obsolescence schemes, such as: * single active released version, anything older is marked as obsolete whenever a new (non pre-release) version is uploaded * semantic versioning, with a given maximum number of active released X versions (e.g. 2), but only the most recent (according to PEP 440) released version with a given X.* is active, everything else is obsolete * CPython-style and date-based versioning, with a given maximum number of active released X.Y versions (e.g. 2), but only the most recent (according to PEP 440) released version with a given X.Y.* is active, everything else is obsolete Pre-release versions could also be automatically flagged as obsolete by PyPI as soon as a newer version for the same release (including the final release itself) was uploaded for the given package. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

3267

Age (days ago)

3268

Last active (days ago)

List overview

Download

23 comments

10 participants

participants (10)

Ben Finney
Daniel Holth
David Cournapeau
Donald Stufft
Jim Fulton
Justin Cappos
Marcus Smith
Nick Coghlan
Noah Kantrowitz
Robert Collins

PyPI is a sick sick hoarder

Marcus Smith

tags

participants (10)