So, I am working on pip issue 988: pip doesn't resolve packages at all.
This is O(packages^alternatives_per_package): if you are resolving 10
packages with 10 versions each, there are approximately 10^10 or 10G
combinations. 10 packages with 100 versions each - 10^100.
So - its going to depend pretty heavily on some good heuristics in
whatever final algorithm makes its way in, but the problem is
exacerbated by PyPI's nature.
Most Linux (all that i'm aware of) distributions have at most 5
versions of a package to consider at any time - installed(might be
None), current release, current release security updates, new release
being upgraded to, new release being upgraded to's security updates.
And their common worst case is actually 2 versions: installed==current
release and one new release present. They map alternatives out into
separate packages (e.g. when an older soname is deliberately kept
across an ABI incompatibility, you end up with 2 packages, not 2
versions of one package). To when comparing pip's challenge to apt's:
apt has ~20-30K packages, with altnernatives ~= 2, or
pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft)
Scaling the number of packages is relatively easy; scaling the number
of alternatives is harder. Even 300 packages (the dependency tree for
openstack) is ~2.4T combinations to probe.
I wonder if it makes sense to give some back-pressure to people, or at
the very least encourage them to remove distributions that:
- they don't support anymore
- have security holes
If folk consider PyPI a sort of historical archive then perhaps we
could have a feature to select 'supported' versions by the author, and
allow a query parameter to ask for all the versions.
-Rob
--
Robert Collins
One thing to consider is that if conflicts do not exist (or are very rare),
the number of possible combinations is a moot point. A greedy algorithm
for installation (which just chooses the most favored package to resolve
each dependency) will run in linear time with the number of packages it
would install, if no conflicts exist.
So, what you are saying about state exploration may be true for a resolver
that uses something like a SAT solver, but doesn't apply to backtracking
dependency resolution (unless a huge number of conflicts occur) or simple
dependency resolution (at all). SAT solvers do have heuristics to avoid
this blow up, except in pathological cases. However, simple / backtracking
dependency resolution systems have the further advantage of not needing to
request unneeded metadata in the first place...
Thanks,
Justin
On Fri, May 15, 2015 at 2:57 PM, Robert Collins
So, I am working on pip issue 988: pip doesn't resolve packages at all.
This is O(packages^alternatives_per_package): if you are resolving 10 packages with 10 versions each, there are approximately 10^10 or 10G combinations. 10 packages with 100 versions each - 10^100.
So - its going to depend pretty heavily on some good heuristics in whatever final algorithm makes its way in, but the problem is exacerbated by PyPI's nature.
Most Linux (all that i'm aware of) distributions have at most 5 versions of a package to consider at any time - installed(might be None), current release, current release security updates, new release being upgraded to, new release being upgraded to's security updates. And their common worst case is actually 2 versions: installed==current release and one new release present. They map alternatives out into separate packages (e.g. when an older soname is deliberately kept across an ABI incompatibility, you end up with 2 packages, not 2 versions of one package). To when comparing pip's challenge to apt's: apt has ~20-30K packages, with altnernatives ~= 2, or pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft)
Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.
I wonder if it makes sense to give some back-pressure to people, or at the very least encourage them to remove distributions that: - they don't support anymore - have security holes
If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.
-Rob
-- Robert Collins
Distinguished Technologist HP Converged Cloud _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On 16 May 2015 at 07:18, Justin Cappos
One thing to consider is that if conflicts do not exist (or are very rare), the number of possible combinations is a moot point. A greedy algorithm for installation (which just chooses the most favored package to resolve each dependency) will run in linear time with the number of packages it would install, if no conflicts exist.
So, what you are saying about state exploration may be true for a resolver that uses something like a SAT solver, but doesn't apply to backtracking dependency resolution (unless a huge number of conflicts occur) or simple dependency resolution (at all). SAT solvers do have heuristics to avoid this blow up, except in pathological cases. However, simple / backtracking dependency resolution systems have the further advantage of not needing to request unneeded metadata in the first place...
Your intuition here is misleading, sorry :(.
You're right about 'hey if everything fits its linear', but the reason
we have this bug open, with people adding a new example of it every
week or so (as someone who didn't realise that pip doesn't resolve
finds out and adds it to pip's issue tracker, only to have it made
into a dupe).
Backtracking recursive resolvers have exactly the same O as SAT.
Example: say I have an ecosystem of 10 packages. A-J. And they do a
release every 6 months that is guaranteed to work together, but every
time some issue occurs which ends up clamping the group together- e.g.
an external release breaks API and so A1s deps are disjoint with A2s,
and then the same between A2 and A3. Even though A1's API is
compatible with B2's: its not internal bad code, its just taking *one*
external dep breaking its API.
After 2 releases you have 10^2 combinations, but only 4 are valid at
all. Thats 4%. 8 releases gets you 10^8, 8 valid combinations, or
0.0000008%.
Now there are two things to examine here. How likely is this to happen
to PyPI users, and can a backtracker (which btw is what my code is)
handle this better than a SAT solver.
In terms of likelyhood - OpenStack hits this every release. Its not
that our libraries are incompatible with each other, its that given
250 packages (the 200 in error I quoted just shows that the resolver
hadn't obtained version data for everything), *something* breaks API
compat in each 6 month release cycle, and so you end up with the whole
set effectively locking together. In fact, it has happened so
consistently to OpenStack that we now release our libraries with
closed specifiers : >=min_version,
Example: say I have an ecosystem of 10 packages. A-J. And they do a release every 6 months that is guaranteed to work together, but every time some issue occurs which ends up clamping the group together- e.g. an external release breaks API and so A1s deps are disjoint with A2s, and then the same between A2 and A3. Even though A1's API is compatible with B2's: its not internal bad code, its just taking *one* external dep breaking its API.
After 2 releases you have 10^2 combinations, but only 4 are valid at all. Thats 4%. 8 releases gets you 10^8, 8 valid combinations, or 0.0000008%.
Yes, so this would not be a situation where "conflicts do not exist (or are very rare)" as my post mentioned. Is this rate of conflicts something you measured or is it a value you made up? I don't hear anyone arguing that the status quo makes sense. I think we're mostly just chatting about the right thing to optimize the solution for and what sorts of short cuts may be useful (or even necessary). Since we can measure the actual conflict and other values in practice, data seems like it may be a good path toward grounding the discussion... Thanks, Justin
On 16 May 2015 at 08:46, Justin Cappos
Example: say I have an ecosystem of 10 packages. A-J. And they do a release every 6 months that is guaranteed to work together, but every time some issue occurs which ends up clamping the group together- e.g. an external release breaks API and so A1s deps are disjoint with A2s, and then the same between A2 and A3. Even though A1's API is compatible with B2's: its not internal bad code, its just taking *one* external dep breaking its API.
After 2 releases you have 10^2 combinations, but only 4 are valid at all. Thats 4%. 8 releases gets you 10^8, 8 valid combinations, or 0.0000008%.
Yes, so this would not be a situation where "conflicts do not exist (or are very rare)" as my post mentioned. Is this rate of conflicts something you measured or is it a value you made up?
It's drawn from the concrete example of OpenStack, which has a single
group of co-installable releases that cluster together every 6 months.
I don't have the actual valid/invalid ratio there because I don't have
enough machines to calculate it:).
-Rob
--
Robert Collins
On May 15, 2015, at 2:57 PM, Robert Collins
wrote: So, I am working on pip issue 988: pip doesn't resolve packages at all.
This is O(packages^alternatives_per_package): if you are resolving 10 packages with 10 versions each, there are approximately 10^10 or 10G combinations. 10 packages with 100 versions each - 10^100.
So - its going to depend pretty heavily on some good heuristics in whatever final algorithm makes its way in, but the problem is exacerbated by PyPI's nature.
Most Linux (all that i'm aware of) distributions have at most 5 versions of a package to consider at any time - installed(might be None), current release, current release security updates, new release being upgraded to, new release being upgraded to's security updates. And their common worst case is actually 2 versions: installed==current release and one new release present. They map alternatives out into separate packages (e.g. when an older soname is deliberately kept across an ABI incompatibility, you end up with 2 packages, not 2 versions of one package). To when comparing pip's challenge to apt's: apt has ~20-30K packages, with altnernatives ~= 2, or pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft)
Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.
I wonder if it makes sense to give some back-pressure to people, or at the very least encourage them to remove distributions that: - they don't support anymore - have security holes
If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.
There have been a handful of projects which would only keep the latest N versions uploaded to PyPI. I know this primarily because it has caused people a decent amount of pain over time. It’s common for deployments people have to use a requirements.txt file like ``foo==1.0`` and to just continue to pull from PyPI. Deleting the old files breaks anyone doing that, so it would require either having people bundle their deps in their repositories or some way to get at those old versions. Personally I think that we shouldn’t go deleting the old versions or encouraging people to do that. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
Donald Stufft
On May 15, 2015, at 2:57 PM, Robert Collins
wrote: If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.
It’s common for deployments people have to use a requirements.txt file like ``foo==1.0`` and to just continue to pull from PyPI. Deleting the old files breaks anyone doing that, so it would require either having people bundle their deps in their repositories or some way to get at those old versions. Personally I think that we shouldn’t go deleting the old versions or encouraging people to do that.
Yes, it's common to consider PyPI as a repository of all versions ever released, and to treat it as an archive whose URLs will continue to make available the historical versions. -- \ “If history and science have taught us anything, it is that | `\ passion and desire are not the same as truth.” —E. O. Wilson, | _o__) _Consilience_, 1998 | Ben Finney
On 16 May 2015 at 07:19, Donald Stufft
There have been a handful of projects which would only keep the latest N versions uploaded to PyPI. I know this primarily because it has caused people a decent amount of pain over time. It’s common for deployments people have to use a requirements.txt file like ``foo==1.0`` and to just continue to pull from PyPI. Deleting the old files breaks anyone doing that, so it would require either having people bundle their deps in their repositories or some way to get at those old versions. Personally I think that we shouldn’t go deleting the old versions or encouraging people to do that.
I think 'most recent only' is too much. Most upstreams will support
more than one release. Like - I don't care what testtools release you
use.
OTOH, every version with distinct dependencies becomes a very
expensive liability to the ecosystem here. It's beyond human scale,
and well in the territory of argh wtf the universe is burning around
me and my tardis has run out of power.
I'm sure we can provide an escape hatch in pip (and I'm going to do
that in my branch soon - offering simple 'error on conflict' and 'use
first seen specifier only' strategies) while folk work on different
heuristics - the actual resolver is only ~100 LOC in my branch today -
the rest is refactoring (that can be made better and I plan to do so
before suggesting we merge it).
But a significant contributing factor is the O of the problem, and we
can do something about that. I don't know what exactly, and I think
we're going to need to have our creative caps firmly on to come up
with something meeting the broad needs of the ecosystem: which
includes pip Just Working.
-Rob
--
Robert Collins
On May 15, 2015, at 9:19 PM, Donald Stufft
On May 15, 2015, at 2:57 PM, Robert Collins
wrote: So, I am working on pip issue 988: pip doesn't resolve packages at all.
This is O(packages^alternatives_per_package): if you are resolving 10 packages with 10 versions each, there are approximately 10^10 or 10G combinations. 10 packages with 100 versions each - 10^100.
So - its going to depend pretty heavily on some good heuristics in whatever final algorithm makes its way in, but the problem is exacerbated by PyPI's nature.
Most Linux (all that i'm aware of) distributions have at most 5 versions of a package to consider at any time - installed(might be None), current release, current release security updates, new release being upgraded to, new release being upgraded to's security updates. And their common worst case is actually 2 versions: installed==current release and one new release present. They map alternatives out into separate packages (e.g. when an older soname is deliberately kept across an ABI incompatibility, you end up with 2 packages, not 2 versions of one package). To when comparing pip's challenge to apt's: apt has ~20-30K packages, with altnernatives ~= 2, or pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft)
Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.
I wonder if it makes sense to give some back-pressure to people, or at the very least encourage them to remove distributions that: - they don't support anymore - have security holes
If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.
There have been a handful of projects which would only keep the latest N versions uploaded to PyPI. I know this primarily because it has caused people a decent amount of pain over time. It’s common for deployments people have to use a requirements.txt file like ``foo==1.0`` and to just continue to pull from PyPI. Deleting the old files breaks anyone doing that, so it would require either having people bundle their deps in their repositories or some way to get at those old versions. Personally I think that we shouldn’t go deleting the old versions or encouraging people to do that.
+1 for this. While I appreciate why Linux distress purge old versions, it is absolutely hellish for reproducibility. If you are looking for prior art, check out the Molinillo project (https://github.com/CocoaPods/Molinillo) used by Bundler and CocoaPods. It is not as complex as the Solve gem used in Chef but offers a good balance of performance in satisfying constraints and false negatives on solution failures. --Noah
On 16 May 2015 at 06:57, Robert Collins
So, I am working on pip issue 988: pip doesn't resolve packages at all.
This is O(packages^alternatives_per_package): if you are resolving 10 ... Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.
I added a check for the exact number (when the current step limit is hit):
Hit step limit during resolving,
22493640689038530013767184665222125808455708963348534886974974630893524036813561125576881299950281714638872640331745747555743820280235291929928862660035516365300612827387994788286647556890876840654454905860390366740480.000000
from 4038 versions in 205 packages after 100000 steps
Which indicates a alternatives factor of ~20. And AIUI PyPI has a long
tail itself, so its more common that folk will see > 5.7 factors,
rather than less common.
-Rob
--
Robert Collins
On Fri, May 15, 2015 at 2:57 PM, Robert Collins
So, I am working on pip issue 988: pip doesn't resolve packages at all.
This is O(packages^alternatives_per_package): if you are resolving 10 packages with 10 versions each, there are approximately 10^10 or 10G combinations. 10 packages with 100 versions each - 10^100.
So - its going to depend pretty heavily on some good heuristics in whatever final algorithm makes its way in, but the problem is exacerbated by PyPI's nature.
Most Linux (all that i'm aware of) distributions have at most 5 versions of a package to consider at any time - installed(might be None), current release, current release security updates, new release being upgraded to, new release being upgraded to's security updates. And their common worst case is actually 2 versions: installed==current release and one new release present. They map alternatives out into separate packages (e.g. when an older soname is deliberately kept across an ABI incompatibility, you end up with 2 packages, not 2 versions of one package). To when comparing pip's challenge to apt's: apt has ~20-30K packages, with altnernatives ~= 2, or pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft)
Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.
I wonder if it makes sense to give some back-pressure to people, or at the very least encourage them to remove distributions that: - they don't support anymore - have security holes
If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.
You could simply limit the number of versions from PyPI you consider. Jim -- Jim Fulton http://www.linkedin.com/in/jimfulton
On 16 May 2015 at 08:27, Jim Fulton
If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.
You could simply limit the number of versions from PyPI you consider.
Yes - it would be nice IMO to give package authors some influence over
that though.
-Rob
--
Robert Collins
Why not start with pip at least being a "simple" fail-on-conflict resolver
(vs the "1st found wins" resolver it is now)...
You'd "backtrack" for the sake of re-walking when new constraints are
found, but not for the purpose of solving conflicts.
I know you're motivated to solve Openstack build issues, but many of the
issues I've seen in the pip tracker, I think would be solved without the
backtracking resolver you're trying to build.
On Fri, May 15, 2015 at 11:57 AM, Robert Collins
So, I am working on pip issue 988: pip doesn't resolve packages at all.
This is O(packages^alternatives_per_package): if you are resolving 10 packages with 10 versions each, there are approximately 10^10 or 10G combinations. 10 packages with 100 versions each - 10^100.
So - its going to depend pretty heavily on some good heuristics in whatever final algorithm makes its way in, but the problem is exacerbated by PyPI's nature.
Most Linux (all that i'm aware of) distributions have at most 5 versions of a package to consider at any time - installed(might be None), current release, current release security updates, new release being upgraded to, new release being upgraded to's security updates. And their common worst case is actually 2 versions: installed==current release and one new release present. They map alternatives out into separate packages (e.g. when an older soname is deliberately kept across an ABI incompatibility, you end up with 2 packages, not 2 versions of one package). To when comparing pip's challenge to apt's: apt has ~20-30K packages, with altnernatives ~= 2, or pip has ~60K packages, with alternatives ~= 5.7 (I asked dstufft)
Scaling the number of packages is relatively easy; scaling the number of alternatives is harder. Even 300 packages (the dependency tree for openstack) is ~2.4T combinations to probe.
I wonder if it makes sense to give some back-pressure to people, or at the very least encourage them to remove distributions that: - they don't support anymore - have security holes
If folk consider PyPI a sort of historical archive then perhaps we could have a feature to select 'supported' versions by the author, and allow a query parameter to ask for all the versions.
-Rob
-- Robert Collins
Distinguished Technologist HP Converged Cloud _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On 16 May 2015 at 11:08, Marcus Smith
Why not start with pip at least being a "simple" fail-on-conflict resolver (vs the "1st found wins" resolver it is now)...
You'd "backtrack" for the sake of re-walking when new constraints are found, but not for the purpose of solving conflicts.
I know you're motivated to solve Openstack build issues, but many of the issues I've seen in the pip tracker, I think would be solved without the backtracking resolver you're trying to build.
Well, I'm scratching the itch I have. If its too hard to get something
decent, sure I might back off in my goals, but I see no point aiming
for something less than all the other language specific packaging
systems out there have.
-Rob
--
Robert Collins
On May 15, 2015, at 9:22 PM, Robert Collins
wrote: On 16 May 2015 at 11:08, Marcus Smith
wrote: Why not start with pip at least being a "simple" fail-on-conflict resolver (vs the "1st found wins" resolver it is now)...
You'd "backtrack" for the sake of re-walking when new constraints are found, but not for the purpose of solving conflicts.
I know you're motivated to solve Openstack build issues, but many of the issues I've seen in the pip tracker, I think would be solved without the backtracking resolver you're trying to build.
Well, I'm scratching the itch I have. If its too hard to get something decent, sure I might back off in my goals, but I see no point aiming for something less than all the other language specific packaging systems out there have.
So what makes the other language specific packaging systems different? As far as I know all of them have complete archives (e.g. they are like PyPI where they have a lot of versions, not like Linux Distros). What can we learn from how they solved this? --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
On 16 May 2015 at 13:45, Donald Stufft
On May 15, 2015, at 9:22 PM, Robert Collins
wrote: On 16 May 2015 at 11:08, Marcus Smith
wrote: Why not start with pip at least being a "simple" fail-on-conflict resolver (vs the "1st found wins" resolver it is now)...
You'd "backtrack" for the sake of re-walking when new constraints are found, but not for the purpose of solving conflicts.
I know you're motivated to solve Openstack build issues, but many of the issues I've seen in the pip tracker, I think would be solved without the backtracking resolver you're trying to build.
Well, I'm scratching the itch I have. If its too hard to get something decent, sure I might back off in my goals, but I see no point aiming for something less than all the other language specific packaging systems out there have.
So what makes the other language specific packaging systems different? As far as I know all of them have complete archives (e.g. they are like PyPI where they have a lot of versions, not like Linux Distros). What can we learn from how they solved this?
NB; I have by no means finished low hanging heuristics and space
trimming stuff :). I have some simple things in mind and am sure I'll
end up with something 'good enough' for day to day use. The thing I'm
worried about is the long term health of the approach.
Good questions. Some of it is structural I suspect. A quick rundown.
cabal (haskell) has a backtracking solver that accepts various
parameters to tell it to try harder.
javascript effectively vendors every dep ever, so you end up with many
copies of the same library at different versions in the same process.
rust's cargo system currently solves everything in a single project
only - it has no binary packaging, only vendor-into-a-binary-build
packaging.
The gem behaviour I'm not yet familiar with.
perl I used to know but time has eroded it :/.
-Rob
--
Robert Collins
On Sat, May 16, 2015 at 10:52 AM, Robert Collins
On 16 May 2015 at 13:45, Donald Stufft
wrote: On May 15, 2015, at 9:22 PM, Robert Collins
On 16 May 2015 at 11:08, Marcus Smith
wrote: Why not start with pip at least being a "simple" fail-on-conflict
resolver
(vs the "1st found wins" resolver it is now)...
You'd "backtrack" for the sake of re-walking when new constraints are found, but not for the purpose of solving conflicts.
I know you're motivated to solve Openstack build issues, but many of
issues I've seen in the pip tracker, I think would be solved without
wrote: the the
backtracking resolver you're trying to build.
Well, I'm scratching the itch I have. If its too hard to get something decent, sure I might back off in my goals, but I see no point aiming for something less than all the other language specific packaging systems out there have.
So what makes the other language specific packaging systems different? As far as I know all of them have complete archives (e.g. they are like PyPI where they have a lot of versions, not like Linux Distros). What can we learn from how they solved this?
NB; I have by no means finished low hanging heuristics and space trimming stuff :). I have some simple things in mind and am sure I'll end up with something 'good enough' for day to day use. The thing I'm worried about is the long term health of the approach.
Good questions. Some of it is structural I suspect. A quick rundown. cabal (haskell) has a backtracking solver that accepts various parameters to tell it to try harder. javascript effectively vendors every dep ever, so you end up with many copies of the same library at different versions in the same process. rust's cargo system currently solves everything in a single project only - it has no binary packaging, only vendor-into-a-binary-build packaging. The gem behaviour I'm not yet familiar with. perl I used to know but time has eroded it :/.
FWIW, php uses a SAT-based solver in composer, which started as a port of libsolv (the SAT solver used by openSUSE and soon Fedora). I am no expert, but I don't understand why backtracking algorithms would to be faster than SAT, since they both potentially need to walk over the full set of possible solutions. It is hard to reason about the cost because the worst case is in theory growing exponentially in both cases. With a SAT-based algorithm for dependency resolution, it is relatively simple to apply heuristics which massively prune the search space. For example, when considering package A with say 10 potential versions A_1, etc..., in theory, you need to generate the rules: # - means not install, + means install - A_1 | - A_2 - A_1 | - A_3 ... and those constitute most of the rules in common cases. But it is possible to tweak the SAT implementation to replace those rules by a single "AtMost one of" rule per *package*, which means the #rules do not grow much by versions. The real difficulty of SAT-based solver is the optimization part: many actually valid solutions are not acceptable, and that's where the heuristics get more complicated. David
I am no expert, but I don't understand why backtracking algorithms would to be faster than SAT, since they both potentially need to walk over the full set of possible solutions. It is hard to reason about the cost because the worst case is in theory growing exponentially in both cases.
This is talked about a bit in this thread: https://github.com/pypa/pip/issues/988 Each algorithm could be computationally more efficient. Basically, *if there are no conflicts* backtracking will certainly win. If there are a huge number of conflicts a SAT solver will certainly win. It's not clear where the tipping point is between the two schemes. However, a better question is does the computational difference matter? If one is a microsecond faster than the other, I don't think anyone cares. However, from the OPIUM paper (listed off of that thread), it is clear that SAT solver resolution can be slow without optimizations to make them work more like backtracking resolvers. From my experience backtracking resolvers are also slow when the conflict rate is high. This only considers computation cost though. Other factors can become more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them. The bandwidth cost for SAT solvers should be higher. Thanks, Justin P.S. If you'd like to talk off list, possibly over Skype, I'd be happy to talk more with you and/or Robert about minutiae that others may not care about.
On Sat, May 16, 2015 at 11:36 PM, Justin Cappos
I am no expert, but I don't understand why backtracking algorithms would
to be faster than SAT, since they both potentially need to walk over the full set of possible solutions. It is hard to reason about the cost because the worst case is in theory growing exponentially in both cases.
This is talked about a bit in this thread: https://github.com/pypa/pip/issues/988
Each algorithm could be computationally more efficient. Basically, *if there are no conflicts* backtracking will certainly win. If there are a huge number of conflicts a SAT solver will certainly win. It's not clear where the tipping point is between the two schemes.
However, a better question is does the computational difference matter? If one is a microsecond faster than the other, I don't think anyone cares. However, from the OPIUM paper (listed off of that thread), it is clear that SAT solver resolution can be slow without optimizations to make them work more like backtracking resolvers. From my experience backtracking resolvers are also slow when the conflict rate is high.
Pure SAT is fast enough in practice in my experience (concretely: solving thousand of rules takes < 1 sec). It becomes more complicated as you need to optimize the solution, especially when you have already installed packages. This is unfortunately not as well discussed in the literature. Pseudo-boolean SAT for optimization was argued to be too slow by the 0 install people, but OTOH, this seems to be what's used in conda, so who knows :) If you SAT solver is in pure python, you can choose a "direction" of the search which is more meaningful. I believe this is what 0install does from reading http://0install.net/solver.html, and what we have in our own SAT solver code. I unfortunately cannot look at the 0install code myself as it is under the GPL and am working on a BSD solver implementation. I also do not know how they handle updates and already installed packages.
This only considers computation cost though. Other factors can become more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them. The bandwidth cost for SAT solvers should be higher.
With a reasonable representation, I think you can make it small enough. To give an idea, our index @ Enthought containing around 20k packages takes ~340 kb compressed w/ bz2 if you only keep the data required for dependency handling (name, version and runtime dependencies), and that's using json, an inefficient encoding, so I suspect encoding all of pypi may be a few MB only fetch, which is generally faster that doing tens of http requests. The libsvolv people worked on a binary representation that may also be worth looking at.
P.S. If you'd like to talk off list, possibly over Skype, I'd be happy to talk more with you and/or Robert about minutiae that others may not care about.
Sure, I would be happy too. As I mentioned before, we have some code around a SAT-based solver, but it is not ready yet, which is why we kept it private (https://github.com/enthought/sat-solver). It handles well (== both speed and quality-wise) the case where nothing is installed, but behaves poorly when packages are already installed, and does not handle the update case yet. The code is also very prototype-ish, but is not too complicated to experimente with it. David
On May 16, 2015 11:22 AM, "David Cournapeau"
On Sat, May 16, 2015 at 11:36 PM, Justin Cappos
wrote: I am no expert, but I don't understand why backtracking algorithms
This is talked about a bit in this thread:
https://github.com/pypa/pip/issues/988
Each algorithm could be computationally more efficient. Basically, *if
However, a better question is does the computational difference matter?
If one is a microsecond faster than the other, I don't think anyone cares. However, from the OPIUM paper (listed off of that thread), it is clear that SAT solver resolution can be slow without optimizations to make them work more like backtracking resolvers. From my experience backtracking resolvers are also slow when the conflict rate is high.
Pure SAT is fast enough in practice in my experience (concretely: solving
If you SAT solver is in pure python, you can choose a "direction" of the search which is more meaningful. I believe this is what 0install does from reading http://0install.net/solver.html, and what we have in our own SAT solver code. I unfortunately cannot look at the 0install code myself as it is under the GPL and am working on a BSD solver implementation. I also do not know how they handle updates and already installed packages.
This only considers computation cost though. Other factors can become
more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them. The bandwidth cost for SAT solvers should be higher.
With a reasonable representation, I think you can make it small enough. To give an idea, our index @ Enthought containing around 20k packages takes ~340 kb compressed w/ bz2 if you only keep the data required for dependency handling (name, version and runtime dependencies), and that's using json, an inefficient encoding, so I suspect encoding all of pypi may be a few MB only fetch, which is generally faster that doing tens of http requests.
The libsvolv people worked on a binary representation that may also be worth looking at.
P.S. If you'd like to talk off list, possibly over Skype, I'd be happy
to talk more with you and/or Robert about minutiae that others may not care about.
Sure, I would be happy too. As I mentioned before, we have some code around a SAT-based solver, but it is not ready yet, which is why we kept it
would to be faster than SAT, since they both potentially need to walk over the full set of possible solutions. It is hard to reason about the cost because the worst case is in theory growing exponentially in both cases. there are no conflicts* backtracking will certainly win. If there are a huge number of conflicts a SAT solver will certainly win. It's not clear where the tipping point is between the two schemes. thousand of rules takes < 1 sec). It becomes more complicated as you need to optimize the solution, especially when you have already installed packages. This is unfortunately not as well discussed in the literature. Pseudo-boolean SAT for optimization was argued to be too slow by the 0 install people, but OTOH, this seems to be what's used in conda, so who knows :) Where optimizing means something like "find a solution with the newest possible releases of the required packages", not execution speed. private (https://github.com/enthought/sat-solver). It handles well (== both speed and quality-wise) the case where nothing is installed, but behaves poorly when packages are already installed, and does not handle the update case yet. The code is also very prototype-ish, but is not too complicated to experimente with it.
David
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On Sun, May 17, 2015 at 12:40 AM, Daniel Holth
On May 16, 2015 11:22 AM, "David Cournapeau"
wrote: On Sat, May 16, 2015 at 11:36 PM, Justin Cappos
wrote: I am no expert, but I don't understand why backtracking algorithms
This is talked about a bit in this thread:
https://github.com/pypa/pip/issues/988
Each algorithm could be computationally more efficient. Basically, *if
However, a better question is does the computational difference
matter? If one is a microsecond faster than the other, I don't think anyone cares. However, from the OPIUM paper (listed off of that thread), it is clear that SAT solver resolution can be slow without optimizations to make them work more like backtracking resolvers. From my experience backtracking resolvers are also slow when the conflict rate is high.
Pure SAT is fast enough in practice in my experience (concretely: solving thousand of rules takes < 1 sec). It becomes more complicated as you need to optimize the solution, especially when you have already installed packages. This is unfortunately not as well discussed in the
would to be faster than SAT, since they both potentially need to walk over the full set of possible solutions. It is hard to reason about the cost because the worst case is in theory growing exponentially in both cases. there are no conflicts* backtracking will certainly win. If there are a huge number of conflicts a SAT solver will certainly win. It's not clear where the tipping point is between the two schemes. literature. Pseudo-boolean SAT for optimization was argued to be too slow by the 0 install people, but OTOH, this seems to be what's used in conda, so who knows :)
Where optimizing means something like "find a solution with the newest possible releases of the required packages", not execution speed.
Indeed, it was not obvious in this context :) Though in theory, optimization is more general. It could be optimizing w.r.t. a cost function taking into account #packages, download size, minimal number of changes, etc... This is where you want a pseudo-boolean SAT, which is what conda uses I think. 0install, composer and I believe libsolv took a different route, and use heuristics to find a reasonably good solution by picking the next candidate. This requires access to the internals of the SAT solver though (not a problem if you have a python implementation). David
If you SAT solver is in pure python, you can choose a "direction" of the search which is more meaningful. I believe this is what 0install does from reading http://0install.net/solver.html, and what we have in our own SAT solver code. I unfortunately cannot look at the 0install code myself as it is under the GPL and am working on a BSD solver implementation. I also do not know how they handle updates and already installed packages.
This only considers computation cost though. Other factors can become
more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them. The bandwidth cost for SAT solvers should be higher.
With a reasonable representation, I think you can make it small enough. To give an idea, our index @ Enthought containing around 20k packages takes ~340 kb compressed w/ bz2 if you only keep the data required for dependency handling (name, version and runtime dependencies), and that's using json, an inefficient encoding, so I suspect encoding all of pypi may be a few MB only fetch, which is generally faster that doing tens of http requests.
The libsvolv people worked on a binary representation that may also be worth looking at.
P.S. If you'd like to talk off list, possibly over Skype, I'd be happy
to talk more with you and/or Robert about minutiae that others may not care about.
Sure, I would be happy too. As I mentioned before, we have some code around a SAT-based solver, but it is not ready yet, which is why we kept it private (https://github.com/enthought/sat-solver). It handles well (== both speed and quality-wise) the case where nothing is installed, but behaves poorly when packages are already installed, and does not handle the update case yet. The code is also very prototype-ish, but is not too complicated to experimente with it.
David
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On 17 May 2015 at 00:36, Justin Cappos
This only considers computation cost though. Other factors can become more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them.
This is the defining consideration for pip at this point: a SAT solver requires publication of static dependency metadata on PyPI, which is dependent on both the Warehouse migration *and* the completion and acceptance of PEP 426. Propagation out to PyPI caching proxies and mirrors like devpi and the pulp-python plugin will then take even longer. A backtracking resolver doesn't have those gating dependencies, as it can tolerate the current dynamic metadata model. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On May 16, 2015, at 1:24 PM, Nick Coghlan
wrote: On 17 May 2015 at 00:36, Justin Cappos
wrote: This only considers computation cost though. Other factors can become more expensive than computation. For example, SAT solvers need all the rules to consider. So a SAT solution needs to effectively download the full dependency graph before starting. A backtracking dependency resolver can just download packages or dependency information as it considers them.
This is the defining consideration for pip at this point: a SAT solver requires publication of static dependency metadata on PyPI, which is dependent on both the Warehouse migration *and* the completion and acceptance of PEP 426. Propagation out to PyPI caching proxies and mirrors like devpi and the pulp-python plugin will then take even longer.
A backtracking resolver doesn't have those gating dependencies, as it can tolerate the current dynamic metadata model.
Even when we have Warehouse and PEP 426, that only gives us that data going forward, the 400k files that currently exist on PyPI still won’t have static metadata. We could parse it out for Wheels but not for anything else. For the foreseeable future any solution will need to be able to handle iteratively finding constraints. Though I think a SAT solver can do it if it can handle incremental solving or just by re-doing the SAT problem each time we discover a new constraint. --- Donald Stufft PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
On 16 May 2015 at 11:52, Robert Collins
On 16 May 2015 at 13:45, Donald Stufft
wrote: On May 15, 2015, at 9:22 PM, Robert Collins
wrote: On 16 May 2015 at 11:08, Marcus Smith
wrote: Why not start with pip at least being a "simple" fail-on-conflict resolver (vs the "1st found wins" resolver it is now)...
You'd "backtrack" for the sake of re-walking when new constraints are found, but not for the purpose of solving conflicts.
I know you're motivated to solve Openstack build issues, but many of the issues I've seen in the pip tracker, I think would be solved without the backtracking resolver you're trying to build.
Well, I'm scratching the itch I have. If its too hard to get something decent, sure I might back off in my goals, but I see no point aiming for something less than all the other language specific packaging systems out there have.
So what makes the other language specific packaging systems different? As far as I know all of them have complete archives (e.g. they are like PyPI where they have a lot of versions, not like Linux Distros). What can we learn from how they solved this?
NB; I have by no means finished low hanging heuristics and space trimming stuff :). I have some simple things in mind and am sure I'll end up with something 'good enough' for day to day use. The thing I'm worried about is the long term health of the approach.
Longer term, I think it makes sense to have the notion of "active" and "obsolete" versions baked into PyPI's API and the web UI. This wouldn't be baked into the package metadata itself (unlike the proposed "Obsoleted-By" field for project renaming), but rather be a dynamic reflection of whether or not *new* users should be looking at the affected version, and whether or not it should be considered as a candidate for dependency resolution when not specifically requested. (This could also replace the current "hidden versions" feature, which only hides things from the web UI, without having any impact on the information published to automated tools through the programmatic API) Tools that list outdated packages could also be simplified a bit, as their first pass could just be to check the obsolescence markers on installed packages, with the second pass being to check for newer versions of those packages. While the bare minimum would be to let project mantainers set the obsolescence flag directly, we could also potentially offer projects some automated obsolescence schemes, such as: * single active released version, anything older is marked as obsolete whenever a new (non pre-release) version is uploaded * semantic versioning, with a given maximum number of active released X versions (e.g. 2), but only the most recent (according to PEP 440) released version with a given X.* is active, everything else is obsolete * CPython-style and date-based versioning, with a given maximum number of active released X.Y versions (e.g. 2), but only the most recent (according to PEP 440) released version with a given X.Y.* is active, everything else is obsolete Pre-release versions could also be automatically flagged as obsolete by PyPI as soon as a newer version for the same release (including the final release itself) was uploaded for the given package. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (10)
-
Ben Finney
-
Daniel Holth
-
David Cournapeau
-
Donald Stufft
-
Jim Fulton
-
Justin Cappos
-
Marcus Smith
-
Nick Coghlan
-
Noah Kantrowitz
-
Robert Collins