[Python-Dev] Re: Should set objects maintain insertion order too?

30 Dec 2019

      Due to version and platform constraints, a SAT solver is necessary for
conda and (now) pip. Libsolv is one of the fastest around.

 https://github.com/QuantStack/mamba is conda implemented with libsolv (and
parallel downloading of *declarative* dependency metadata).

For general purpose graphs with implicit node instantation from edge
declaration, NetworkX has implementations of many graph algorithms.
https://networkx.github.io/documentation/stable/reference/algorithms/generat...

CuGraph (a product of the RAPIDS.ai project) does graphs with CUDA (from
Python) with a "NetworkX-like API" but doesn't yet have a topological sort
(though it does have BFS)
https://github.com/rapidsai/cugraph

"pip needs a dependency resolver" + End (Fn+Right) links to the latest work
on the new pip code (that must require declarative dependency metadata)
https://github.com/pypa/pip/issues/988

That said, this implementation of topo sort could have a deterministic
output given an OrderedSet:
https://rosettacode.org/wiki/Topological_sort#Python

A table including Big-O and memory usage for the necessary data structure
and methods would be useful.

On Sun, Dec 29, 2019, 5:12 PM Tim Peters <tim.peters@gmail.com> wrote:
...
[Larry]
...
It's a lightweight abstract dependency graph.  Its nodes are opaque,
only required to be hashable.  And it doesn't require that you give it
all the nodes in strict dependency order.
When you add a node, you can also optionally specify
dependencies, and those dependencies aren't required
to be nodes that the graph has previously seen.  So you can add
a node A that depends on B and C, without showing it B or C
first.  When that happens it creates placeholder nodes for B
and C, and naturally these nodes have no dependencies.  You
can then later inform the graph that node B has dependencies
on X Y and Z.
(My specific use case is a package manager.  Packages are identified
with unique strings.  So the nodes I give the give the graph are simply
those package names.  It's pretty common for me to get a package
with dependencies on packages I haven't seen yet.  Designing the
graph to create placeholders let me skip making two passes over
the package list, pre-creating the package objects, etc etc.  This
made the code simpler and presumably faster.)
The graph API maintains an externally-visible "ready" iterable of
nodes.  And since B can go from ready to not-ready, it can get added
to "ready" and subsequently removed.
Also, when a node is satisfied, you simply remove it from the graph.
 If A depends on B and C, and they all represent jobs, and B and C
have no dependencies, they'll be in the "ready" iterable.  You iterate
over "ready", and execute those jobs, and assuming they are
successful you then remove them from the graph.  When A's
dependencies are all satisfied, it'll get added to the "ready" iterable.
 And naturally when B and C were removed from the graph, they were
removed from the "ready" iterable too.
OK!  You're doing a topological sort.  There are natural & simple ways
to do that right now that scale efficiently to very large graphs (time
& space linear in the sum of the number of nodes and the number of
dependencies).  Curiously, the difficulties are mostly driven by the
quality of _error_ messages you want (in case of, e.g., cycles in the
dependency graph).
Lots of those implementations became deterministic "by magic" when
ordered dicts were introduced.  This is why:  a bare bones
implementation needs to associate two pieces of info with each node:
a list of its immediate successors, and an integer count of the number
of its immediate predecessors.  A dict is the natural way to implement
that association.
When all the dependency info has been entered, then:
- First all nodes are emitted whose predecessor count is 0.  Iterating
over the association dict once is the natural way to find them, and
that order is defined now.
- Then, as each emitted node N is marked done:
      for child in N.successors:
          assert child.predcount > 0
          child.predcount -= 1
          if child.predcount == 0:
              emit(child)
That's about all there is to it.  But there's no end to the cruft that
_can_ be added to, e.g., verify that a node is marked done no more
than once, or to compute & display strongly connected components if
not all nodes are emitted, etc.
Ordered sets could improve this "natural" implementation if the
successor lists were successor sets instead, simply because then -
like lists - their iteration order would be defined, which can in turn
affect the order in which nodes are emitted in the loop just above.
But lists are adequate to the task, are cheaper in both time and
space, and can't _not_ be ordered ;-)
...
Thus it's this "ready" collection that I want to a) be iterable, b) be
stable, and
c) support fast removal.  I add and remove nodes from that set a lot,
which I
realized would perturb the iteration order from run to run, etc etc etc,
and here
we are.
The sketch above (which is a bog-common way to implement topsorts)
doesn't build a data structure recording the final order, and, in
fact, never deletes anything.  You might protest that the initial
iteration step (scanning the association dict to find nodes with no
predecessors) is an expense you skip because nodes with predecessors
are deleted from your "ready" structure all along.  But nodes with
predecessors are _touched_ either way, and merely skipping over them
is bound to be cheaper than deleting them.
After that initial step, no search of any kind is needed again.
...
I grant you maybe it's a weird approach.  But after two false starts,
where I
was starting to boil the oceans with ABCs and "node is satisfied" APis
and
separate iterator objects, I had a re-think and hit on this super
lightweight
design.  I'm actually quite pleased with it--it's short, it has a
minimal API,
it's readable, it was easy to get right, and it solves my problem.
Whereas I would have started with a topsort and finished it while I
was sleeping ;-)  Seriously, I've written such a thing many times, but
never reuse the code because it's so easy to redo from scratch.  Every
application has _some_ unique twist, which makes a general-purpose API
so bloated it's harder to figure out how to use than to write custom
code.
In your application, I'm guessing (but don't know) that when a package
name is emitted, it's possible you're not able to find the package,
and so the process has to stop.  That's why I inserted a "as each
emitted node N is marked done" step to my description.  That is, I'm
picturing an API that adds a (say)
topsorter.record_succeeded(node)
method to say that - yes - the node was processed successfully.  Else,
by default, its successors (if any) must not be emitted.
But if that isn't needed, the whole topsort order can be materialized
into (say) a list in one gulp, or delivered by a generator.
...
Happy new year,
And to you too! :-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at
https://mail.python.org/archives/list/python-dev@python.org/message/HRKQPTPQ...
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-Dev] Re: Should set objects maintain insertion order too?

Wes Turner