Split thread #2.
Justus Winter writes:
Here are the things I did so far:
- I have Mailman running with runners in threads instead of processes, but that is in a proof-of-concept stage at this point and needs some cleaning up
After working with Mailman 3 and Postfix, I've become fond of the HUPD (HUPD of Uncontrolled Proliferation of Daemons) model of application design, at least for email.
My prototype let's you chose, for every kind of runner, whether to use the process or thread model
That's not a sales point, as far as I'm concerned. It adds complexity for the installer and the site manager, as well as in the code.
I don't quite buy (or maybe I'm not understanding the whole picture) into the argument that having individual processes improves the robustness of the whole system.
I'm talking about the developer/maintainer experience, not about run time.
From my experience, having individual runners killed can render Mailman unusable [0] (and to my then untrained eye it was impossible to see that a runner was missing,
That's some combination of documentation, logging, and tooling bugs. At the very least "mailman status" should report whether all the runners that were started are still present (it doesn't at present).
It's really not hard to detect a crashed or stalled runner, even in a sliced (multirunner) queue -- queuefiles start to pile up. (By "not hard" I mean you can use "ls" or "du", not that it should be obvious what to do.)
if on the other hand Mailman would have been a single process, or a significantly smaller number of processes, a single missing process would have been more apparent),
True, but to me crashes in a monolithic program are less acceptable, expecially threaded, because other concurrent operations may depend on that program staying alive. The way exception handling is done in Mailman 2 with a big "except Exception" around the whole program, you mostly would not get a crash at all, just a log message with an traceback, probably unintelligible to a non-developer of Mailman. Not clear that's a win over the current situation for you. Sure, you can probably arrange for exception handling to be per-thread in some sense, but that's going to be conceptually harder than the the "log the exception, let it crash, have the master restart it and pray" approach we use in the multiprocess model.
and when a runner has picked up a mail from a queue, and then crashes, that mail is lost forever (i.e. runner operations are not atomic).
Please report such incidents in as much detail as you can. The whole point of "store and forward" is to prevent that. Runners should not alter the queuefile until they're done. If they crash in the middle, they should leave the queuefile they received and maybe a work file.