[Distutils] PyPI is a sick sick hoarder

David Cournapeau cournape at gmail.com
Sat May 16 17:48:55 CEST 2015


On Sun, May 17, 2015 at 12:40 AM, Daniel Holth <dholth at gmail.com> wrote:

>
> On May 16, 2015 11:22 AM, "David Cournapeau" <cournape at gmail.com> wrote:
> >
> >
> >
> > On Sat, May 16, 2015 at 11:36 PM, Justin Cappos <jcappos at nyu.edu> wrote:
> >>>
> >>> I am no expert, but I don't understand why backtracking algorithms
> would to be faster than SAT, since they both potentially need to walk over
> the full set of possible solutions. It is hard to reason about the cost
> because the worst case is in theory growing exponentially in both cases.
> >>
> >>
> >> This is talked about a bit in this thread:
> https://github.com/pypa/pip/issues/988
> >>
> >> Each algorithm could be computationally more efficient.  Basically, *if
> there are no conflicts* backtracking will certainly win.  If there are a
> huge number of conflicts a SAT solver will certainly win.  It's not clear
> where the tipping point is between the two schemes.
> >>
> >> However, a better question is does the computational difference
> matter?  If one is a microsecond faster than the other, I don't think
> anyone cares.  However, from the OPIUM paper (listed off of that thread),
> it is clear that SAT solver resolution can be slow without optimizations to
> make them work more like backtracking resolvers.  From my experience
> backtracking resolvers are also slow when the conflict rate is high.
> >
> >
> > Pure SAT is fast enough in practice in my experience (concretely:
> solving thousand of rules takes < 1 sec). It becomes more complicated as
> you need to optimize the solution, especially when you have already
> installed packages. This is unfortunately not as well discussed in the
> literature. Pseudo-boolean SAT for optimization was argued to be too slow
> by the 0 install people, but OTOH, this seems to be what's used in conda,
> so who knows :)
>
> Where optimizing means something like "find a solution with the newest
> possible releases of the required packages", not execution speed.
>

Indeed, it was not obvious in this context :) Though in theory,
optimization is more general. It could be optimizing w.r.t. a cost function
taking into account #packages, download size, minimal number of changes,
etc... This is where you want a pseudo-boolean SAT, which is what conda
uses I think.

0install, composer and I believe libsolv took a different route, and use
heuristics to find a reasonably good solution by picking the next
candidate. This requires access to the internals of the SAT solver though
(not a problem if you have a python implementation).

David

> > If you SAT solver is in pure python, you can choose a "direction" of the
> search which is more meaningful. I believe this is what 0install does from
> reading http://0install.net/solver.html, and what we have in our own SAT
> solver code. I unfortunately cannot look at the 0install code myself as it
> is under the GPL and am working on a BSD solver implementation. I also do
> not know how they handle updates and already installed packages.
> >
> >>
> >> This only considers computation cost though.  Other factors can become
> more expensive than computation.  For example, SAT solvers need all the
> rules to consider.  So a SAT solution needs to effectively download the
> full dependency graph before starting.  A backtracking dependency resolver
> can just download packages or dependency information as it considers them.
> The bandwidth cost for SAT solvers should be higher.
> >
> >
> > With a reasonable representation, I think you can make it small enough.
> To give an idea, our index @ Enthought containing around 20k packages takes
> ~340 kb compressed w/ bz2 if you only keep the data required for dependency
> handling (name, version and runtime dependencies), and that's using json,
> an inefficient encoding, so I suspect encoding all of pypi may be a few MB
> only fetch, which is generally faster that doing tens of http requests.
> >
> > The libsvolv people worked on a binary representation that may also be
> worth looking at.
> >
> >>
> >> P.S.  If you'd like to talk off list, possibly over Skype, I'd be happy
> to talk more with you and/or Robert about minutiae that others may not care
> about.
> >
> >
> > Sure, I would be happy too. As I mentioned before, we have some code
> around a SAT-based solver, but it is not ready yet, which is why we kept it
> private (https://github.com/enthought/sat-solver). It handles well (==
> both speed and quality-wise) the case where nothing is installed, but
> behaves poorly when packages are already installed, and does not handle the
> update case yet. The code is also very prototype-ish, but is not too
> complicated to experimente with it.
> >
> > David
> >
> >
> > _______________________________________________
> > Distutils-SIG maillist  -  Distutils-SIG at python.org
> > https://mail.python.org/mailman/listinfo/distutils-sig
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/distutils-sig/attachments/20150517/178b400e/attachment.html>


More information about the Distutils-SIG mailing list