[Mailman-Developers] Threads and robustness against runner crashes

March 4, 2024

      Split thread #2.
Justus Winter writes:
...
...
...
Here are the things I did so far:

I have Mailman running with runners in threads instead of
processes, but that is in a proof-of-concept stage at this
point and needs some cleaning up

After working with Mailman 3 and Postfix, I've become fond of
the HUPD (HUPD of Uncontrolled Proliferation of Daemons) model
of application design, at least for email.
My prototype let's you chose, for every kind of runner, whether to
use the process or thread model
That's not a sales point, as far as I'm concerned.  It adds complexity
for the installer and the site manager, as well as in the code.
...
I don't quite buy (or maybe I'm not understanding the whole picture)
into the argument that having individual processes improves the
robustness of the whole system.
I'm talking about the developer/maintainer experience, not about run
time.
...
From my experience, having individual runners killed can render
Mailman unusable [0] (and to my then untrained eye it was
impossible to see that a runner was missing,
That's some combination of documentation, logging, and tooling bugs.
At the very least "mailman status" should report whether all the
runners that were started are still present (it doesn't at present).
It's really not hard to detect a crashed or stalled runner, even in a
sliced (multirunner) queue -- queuefiles start to pile up.  (By "not
hard" I mean you can use "ls" or "du", not that it should be obvious
what to do.)
...
if on the other hand Mailman would have been a single process, or a
significantly smaller number of processes, a single missing process
would have been more apparent),
True, but to me crashes in a monolithic program are less acceptable,
expecially threaded, because other concurrent operations may depend on
that program staying alive.  The way exception handling is done in
Mailman 2 with a big "except Exception" around the whole program, you
mostly would not get a crash at all, just a log message with an
traceback, probably unintelligible to a non-developer of Mailman.  Not
clear that's a win over the current situation for you.  Sure, you can
probably arrange for exception handling to be per-thread in some
sense, but that's going to be conceptually harder than the the "log
the exception, let it crash, have the master restart it and pray"
approach we use in the multiprocess model.
...
and when a runner has picked up a mail from a queue, and then
crashes, that mail is lost forever (i.e. runner operations are not
atomic).
Please report such incidents in as much detail as you can.  The whole
point of "store and forward" is to prevent that.  Runners should not
alter the queuefile until they're done.  If they crash in the middle,
they should leave the queuefile they received and maybe a work file.

[Mailman-Developers] Threads and robustness against runner crashes

Stephen J. Turnbull