[Distutils] PyPI is a sick sick hoarder

Robert Collins robertc at robertcollins.net
Fri May 15 21:58:26 CEST 2015


On 16 May 2015 at 07:18, Justin Cappos <jcappos at nyu.edu> wrote:
> One thing to consider is that if conflicts do not exist (or are very rare),
> the number of possible combinations is a moot point.  A greedy algorithm for
> installation (which just chooses the most favored package to resolve each
> dependency) will run in linear time with the number of packages it would
> install, if no conflicts exist.
>
> So, what you are saying about state exploration may be true for a resolver
> that uses something like a SAT solver, but doesn't apply to backtracking
> dependency resolution (unless a huge number of conflicts occur) or simple
> dependency resolution (at all).  SAT solvers do have heuristics to avoid
> this blow up, except in pathological cases.  However, simple / backtracking
> dependency resolution systems have the further advantage of not needing to
> request unneeded metadata in the first place...

Your intuition here is misleading, sorry :(.

You're right about 'hey if everything fits its linear', but the reason
we have this bug open, with people adding a new example of it every
week or so (as someone who didn't realise that pip doesn't resolve
finds out and adds it to pip's issue tracker, only to have it made
into a dupe).

Backtracking recursive resolvers have exactly the same O as SAT.

Example: say I have an ecosystem of 10 packages. A-J. And they do a
release every 6 months that is guaranteed to work together, but every
time some issue occurs which ends up clamping the group together- e.g.
an external release breaks API and so A1s deps are disjoint with A2s,
and then the same between A2 and A3. Even though A1's API is
compatible with B2's: its not internal bad code, its just taking *one*
external dep breaking its API.

After 2 releases you have 10^2 combinations, but only 4 are valid at
all. Thats 4%. 8 releases gets you 10^8, 8 valid combinations, or
0.0000008%.

Now there are two things to examine here. How likely is this to happen
to PyPI users, and can a backtracker (which btw is what my code is)
handle this better than a SAT solver.

In terms of likelyhood - OpenStack hits this every release. Its not
that our libraries are incompatible with each other, its that given
250 packages (the 200 in error I quoted just shows that the resolver
hadn't obtained version data for everything), *something* breaks API
compat in each 6 month release cycle, and so you end up with the whole
set effectively locking together. In fact, it has happened so
consistently to OpenStack that we now release our libraries with
closed specifiers : >=min_version, <next_version.

Secondly, backtrackers. Assume nothing is installed, and you want the
latest release. Then sure,
for a_version in A:
   for b_version in B:
      for c_version in C:

etc will hit the most recent release first time, and you're golden.

Assume you have the prior release installed, and you ran pip without
-U, but something that pulls in the latest release of one lib (which
then nabs everything). e.g. pip install J>3.0 (and we have releases
1,2,3,4 of A through J).

Now, for a_version in A: is going to have the installed version of A
in its first step. B likewise. So we'll end up with a trace like:
A==3
B==3
C==3
D==3
E==3
F==3
G==3
H==3
I==3
J==3 error, backtrack (user specifier)
J==4 error, backtracks once the external deps are considered and the
conflict is found
J==2 error, backtrack (user specifier)
J==1 error, backtrack (user specifier)
I==4
J==3 error, backtrack (user specifier)
J==4 error, backtracks somewere in the external deps
J==2, error, backtrack (user specifier)

and so on, until we finally tick over to A==4.

More generally, any already installed version (without -U) can cause a
backtracking resolver to try *all* possible other combinations before
realising that that installed version is the problem and bumping it. A
heuristic to look for those and bump them first then hits molasses as
soon as one of the installed versions needs to be kept as-is.

Anyhow, my goal here is to start the conversation; pip will need some
knobs because no matter how good the heuristics users will need escape
hatches. (One of which is to fully specify their needs).

-Rob


-- 
Robert Collins <rbtcollins at hp.com>
Distinguished Technologist
HP Converged Cloud


More information about the Distutils-SIG mailing list