On Sun, Dec 29, 2019 at 9:00 PM Wes Turner <wes.turner@gmail.com> wrote:

Due to version and platform constraints, a SAT solver is necessary for conda and (now) pip. Libsolv is one of the fastest around.

https://github.com/QuantStack/mamba is conda implemented with libsolv (and parallel downloading of *declarative* dependency metadata).

For general purpose graphs with implicit node instantation from edge declaration, NetworkX has implementations of many graph algorithms.
https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.dag.topological_sort.html

CuGraph (a product of the RAPIDS.ai project) does graphs with CUDA (from Python) with a "NetworkX-like API" but doesn't yet have a topological sort (though it does have BFS)
https://github.com/rapidsai/cugraph

"pip needs a dependency resolver" + End (Fn+Right) links to the latest work on the new pip code (that must require declarative dependency metadata)
https://github.com/pypa/pip/issues/988

That said, this implementation of topo sort could have a deterministic output given an OrderedSet:
https://rosettacode.org/wiki/Topological_sort#Python

A table including Big-O and memory usage for the necessary data structure and methods would be useful.

On Sun, Dec 29, 2019, 5:12 PM Tim Peters <tim.peters@gmail.com> wrote:
[Larry]
> It's a lightweight abstract dependency graph. Its nodes are opaque,
> only required to be hashable. And it doesn't require that you give it
> all the nodes in strict dependency order.
>
> When you add a node, you can also optionally specify
> dependencies, and those dependencies aren't required
> to be nodes that the graph has previously seen. So you can add
> a node A that depends on B and C, without showing it B or C
> first. When that happens it creates placeholder nodes for B
> and C, and naturally these nodes have no dependencies. You
> can then later inform the graph that node B has dependencies
> on X Y and Z.
>
> (My specific use case is a package manager. Packages are identified
> with unique strings. So the nodes I give the give the graph are simply
> those package names. It's pretty common for me to get a package
> with dependencies on packages I haven't seen yet. Designing the
> graph to create placeholders let me skip making two passes over
> the package list, pre-creating the package objects, etc etc. This
> made the code simpler and presumably faster.)
>
> The graph API maintains an externally-visible "ready" iterable of
> nodes. And since B can go from ready to not-ready, it can get added
> to "ready" and subsequently removed.
>
> Also, when a node is satisfied, you simply remove it from the graph.
> If A depends on B and C, and they all represent jobs, and B and C
> have no dependencies, they'll be in the "ready" iterable. You iterate
> over "ready", and execute those jobs, and assuming they are
> successful you then remove them from the graph. When A's
> dependencies are all satisfied, it'll get added to the "ready" iterable.
> And naturally when B and C were removed from the graph, they were
> removed from the "ready" iterable too.

OK! You're doing a topological sort. There are natural & simple ways
to do that right now that scale efficiently to very large graphs (time
& space linear in the sum of the number of nodes and the number of
dependencies). Curiously, the difficulties are mostly driven by the
quality of _error_ messages you want (in case of, e.g., cycles in the
dependency graph).

Lots of those implementations became deterministic "by magic" when
ordered dicts were introduced. This is why: a bare bones
implementation needs to associate two pieces of info with each node:
a list of its immediate successors, and an integer count of the number
of its immediate predecessors. A dict is the natural way to implement
that association.

When all the dependency info has been entered, then:

- First all nodes are emitted whose predecessor count is 0. Iterating
over the association dict once is the natural way to find them, and
that order is defined now.

- Then, as each emitted node N is marked done:
for child in N.successors:
assert child.predcount > 0
child.predcount -= 1
if child.predcount == 0:
emit(child)

That's about all there is to it. But there's no end to the cruft that
_can_ be added to, e.g., verify that a node is marked done no more
than once, or to compute & display strongly connected components if
not all nodes are emitted, etc.

Ordered sets could improve this "natural" implementation if the
successor lists were successor sets instead, simply because then -
like lists - their iteration order would be defined, which can in turn
affect the order in which nodes are emitted in the loop just above.
But lists are adequate to the task, are cheaper in both time and
space, and can't _not_ be ordered ;-)

> Thus it's this "ready" collection that I want to a) be iterable, b) be stable, and
> c) support fast removal. I add and remove nodes from that set a lot, which I
> realized would perturb the iteration order from run to run, etc etc etc, and here
> we are.

The sketch above (which is a bog-common way to implement topsorts)
doesn't build a data structure recording the final order, and, in
fact, never deletes anything. You might protest that the initial
iteration step (scanning the association dict to find nodes with no
predecessors) is an expense you skip because nodes with predecessors
are deleted from your "ready" structure all along. But nodes with
predecessors are _touched_ either way, and merely skipping over them
is bound to be cheaper than deleting them.

After that initial step, no search of any kind is needed again.

> I grant you maybe it's a weird approach. But after two false starts, where I
> was starting to boil the oceans with ABCs and "node is satisfied" APis and
> separate iterator objects, I had a re-think and hit on this super lightweight
> design. I'm actually quite pleased with it--it's short, it has a minimal API,
> it's readable, it was easy to get right, and it solves my problem.

Whereas I would have started with a topsort and finished it while I
was sleeping ;-) Seriously, I've written such a thing many times, but
never reuse the code because it's so easy to redo from scratch. Every
application has _some_ unique twist, which makes a general-purpose API
so bloated it's harder to figure out how to use than to write custom
code.

In your application, I'm guessing (but don't know) that when a package
name is emitted, it's possible you're not able to find the package,
and so the process has to stop. That's why I inserted a "as each
emitted node N is marked done" step to my description. That is, I'm
picturing an API that adds a (say)

topsorter.record_succeeded(node)

method to say that - yes - the node was processed successfully. Else,
by default, its successors (if any) must not be emitted.

But if that isn't needed, the whole topsort order can be materialized
into (say) a list in one gulp, or delivered by a generator.

> Happy new year,

And to you too! :-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HRKQPTPQQNIU3DYMYUYRC7USVRJGMRFC/
Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6UKXR63M6NYI3EQRXL7AOH3AWMTSNXUW/
Code of Conduct: http://python.org/psf/codeofconduct/