Due to version and platform constraints, a SAT solver is necessary for conda and (now) pip. Libsolv is one of the fastest around. https://github.com/QuantStack/mamba is conda implemented with libsolv (and parallel downloading of *declarative* dependency metadata). For general purpose graphs with implicit node instantation from edge declaration, NetworkX has implementations of many graph algorithms. https://networkx.github.io/documentation/stable/reference/algorithms/generat... CuGraph (a product of the RAPIDS.ai project) does graphs with CUDA (from Python) with a "NetworkX-like API" but doesn't yet have a topological sort (though it does have BFS) https://github.com/rapidsai/cugraph "pip needs a dependency resolver" + End (Fn+Right) links to the latest work on the new pip code (that must require declarative dependency metadata) https://github.com/pypa/pip/issues/988 That said, this implementation of topo sort could have a deterministic output given an OrderedSet: https://rosettacode.org/wiki/Topological_sort#Python A table including Big-O and memory usage for the necessary data structure and methods would be useful. On Sun, Dec 29, 2019, 5:12 PM Tim Peters <tim.peters@gmail.com> wrote:
[Larry]
It's a lightweight abstract dependency graph. Its nodes are opaque, only required to be hashable. And it doesn't require that you give it all the nodes in strict dependency order.
When you add a node, you can also optionally specify dependencies, and those dependencies aren't required to be nodes that the graph has previously seen. So you can add a node A that depends on B and C, without showing it B or C first. When that happens it creates placeholder nodes for B and C, and naturally these nodes have no dependencies. You can then later inform the graph that node B has dependencies on X Y and Z.
(My specific use case is a package manager. Packages are identified with unique strings. So the nodes I give the give the graph are simply those package names. It's pretty common for me to get a package with dependencies on packages I haven't seen yet. Designing the graph to create placeholders let me skip making two passes over the package list, pre-creating the package objects, etc etc. This made the code simpler and presumably faster.)
The graph API maintains an externally-visible "ready" iterable of nodes. And since B can go from ready to not-ready, it can get added to "ready" and subsequently removed.
Also, when a node is satisfied, you simply remove it from the graph. If A depends on B and C, and they all represent jobs, and B and C have no dependencies, they'll be in the "ready" iterable. You iterate over "ready", and execute those jobs, and assuming they are successful you then remove them from the graph. When A's dependencies are all satisfied, it'll get added to the "ready" iterable. And naturally when B and C were removed from the graph, they were removed from the "ready" iterable too.
OK! You're doing a topological sort. There are natural & simple ways to do that right now that scale efficiently to very large graphs (time & space linear in the sum of the number of nodes and the number of dependencies). Curiously, the difficulties are mostly driven by the quality of _error_ messages you want (in case of, e.g., cycles in the dependency graph).
Lots of those implementations became deterministic "by magic" when ordered dicts were introduced. This is why: a bare bones implementation needs to associate two pieces of info with each node: a list of its immediate successors, and an integer count of the number of its immediate predecessors. A dict is the natural way to implement that association.
When all the dependency info has been entered, then:
- First all nodes are emitted whose predecessor count is 0. Iterating over the association dict once is the natural way to find them, and that order is defined now.
- Then, as each emitted node N is marked done: for child in N.successors: assert child.predcount > 0 child.predcount -= 1 if child.predcount == 0: emit(child)
That's about all there is to it. But there's no end to the cruft that _can_ be added to, e.g., verify that a node is marked done no more than once, or to compute & display strongly connected components if not all nodes are emitted, etc.
Ordered sets could improve this "natural" implementation if the successor lists were successor sets instead, simply because then - like lists - their iteration order would be defined, which can in turn affect the order in which nodes are emitted in the loop just above. But lists are adequate to the task, are cheaper in both time and space, and can't _not_ be ordered ;-)
Thus it's this "ready" collection that I want to a) be iterable, b) be stable, and c) support fast removal. I add and remove nodes from that set a lot, which I realized would perturb the iteration order from run to run, etc etc etc, and here we are.
The sketch above (which is a bog-common way to implement topsorts) doesn't build a data structure recording the final order, and, in fact, never deletes anything. You might protest that the initial iteration step (scanning the association dict to find nodes with no predecessors) is an expense you skip because nodes with predecessors are deleted from your "ready" structure all along. But nodes with predecessors are _touched_ either way, and merely skipping over them is bound to be cheaper than deleting them.
After that initial step, no search of any kind is needed again.
I grant you maybe it's a weird approach. But after two false starts, where I was starting to boil the oceans with ABCs and "node is satisfied" APis and separate iterator objects, I had a re-think and hit on this super lightweight design. I'm actually quite pleased with it--it's short, it has a minimal API, it's readable, it was easy to get right, and it solves my problem.
Whereas I would have started with a topsort and finished it while I was sleeping ;-) Seriously, I've written such a thing many times, but never reuse the code because it's so easy to redo from scratch. Every application has _some_ unique twist, which makes a general-purpose API so bloated it's harder to figure out how to use than to write custom code.
In your application, I'm guessing (but don't know) that when a package name is emitted, it's possible you're not able to find the package, and so the process has to stop. That's why I inserted a "as each emitted node N is marked done" step to my description. That is, I'm picturing an API that adds a (say)
topsorter.record_succeeded(node)
method to say that - yes - the node was processed successfully. Else, by default, its successors (if any) must not be emitted.
But if that isn't needed, the whole topsort order can be materialized into (say) a list in one gulp, or delivered by a generator.
Happy new year,
And to you too! :-) _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HRKQPTPQ... Code of Conduct: http://python.org/psf/codeofconduct/