Mailman 3 Introduction, FOSDEM, scaling down, latency, OpenPGP support - Mailman-Developers

Introduction, FOSDEM, scaling down, latency, OpenPGP support

older
Review requested for some merge...

Justus Winter

Jan. 21, 2024

1:08 p.m.

Hello everyone :)

I want to introduce myself, I'm Justus, I have worked on GnuPG in the past and am now working on Sequoia-PGP, and I'm running two Mailman installations in a resource constrained shared hosting environment.

I'd like to contribute a little to Mailman, and I'd like to better understand how the Mailman project is doing nowadays. I have gotten a bugfix merged in the past, but I now have what I think is a fairly uncontroversial cleanup merge request that has neither been merged nor has it gotten comments.

https://gitlab.com/mailman/mailman/-/merge_requests/1158

As FOSDEM is around the corner, are any of you going to be there and are up for a chat?

Besides cleanups and bugfixes, there are three things I'd like to do:

Improve Mailman to better scale down to small installations
Improve latency of messages
Implement OpenPGP support

Here are the things I did so far:

https://gitlab.com/mailman/mailman/-/merge_requests/1094
https://gitlab.com/mailman/mailman/-/issues/1050
https://gitlab.com/mailman/mailman/-/merge_requests/1166
I have Mailman running with runners in threads instead of processes, but that is in a proof-of-concept stage at this point and needs some cleaning up

(I understand that Mailman is a GNU project that wants copyright assignments, and I have done that in the past for other GNU projects, and would be happy to do that for Mailman, but at the same time I feel like putting up *any* barrier to contributing is unfortunate.)

Best, Justus

Attachments:

signature.asc (application/pgp-signature — 584 bytes)

Show replies by date

Дилян Палаузов

February 2024

4:39 p.m.

Hello Justus,

I find it very good that you are trying to reduce the memory consumption of mailman 3. I cannot help in doing this. I hope you find some way to reduce the memory.

Do you know by accident, if the runner processes use identical memory, and by calling https://man7.org/linux/man-pages/man2/madvise.2.html the kernel can somehow detect these identical memory and then use a single instance for all identical regions of memory?

Greetings Дилян

-----Original Message----- From: Justus Winter <justus@sequoia-pgp.org> To: mailman-developers@python.org Subject: [Mailman-Developers] Introduction, FOSDEM, scaling down, latency, OpenPGP support Date: 01/21/2024 02:08:34 PM

Hello everyone :)

- https://gitlab.com/mailman/mailman/-/merge_requests/1158

As FOSDEM is around the corner, are any of you going to be there and are up for a chat?

Besides cleanups and bugfixes, there are three things I'd like to do:

- Improve Mailman to better scale down to small installations - Improve latency of messages - Implement OpenPGP support

Here are the things I did so far:

- https://gitlab.com/mailman/mailman/-/merge_requests/1094 - https://gitlab.com/mailman/mailman/-/issues/1050 - https://gitlab.com/mailman/mailman/-/merge_requests/1166 - I have Mailman running with runners in threads instead of processes, but that is in a proof-of-concept stage at this point and needs some cleaning up

Best, Justus

Stephen J. Turnbull

10:19 a.m.

Дилян Палаузов writes:

...

Do you know by accident, if the runner processes use identical memory, and by calling https://man7.org/linux/man-pages/man2/madvise.2.html the kernel can somehow detect these identical memory and then use a single instance for all identical regions of memory?

The memory size is a side effect of the way fork works (formally, it creates a new process by duplicating the old one, what "really" happens is up to the CPU's MMU and the kernel's memory management). I believe that as long as copy-on-write is enabled it only actually copies pages with changes. Use ps or top to check to see how much of the virtually memory is actually in RAM, and how much is shared.

I'm pretty sure real memory usage was considered in the original design to use processes for each runner rather than threads.

Steve

Justus Winter

March 2024

4:58 p.m.

"Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> writes:

...

Дилян Палаузов writes:

...
Do you know by accident, if the runner processes use identical memory, and by calling https://man7.org/linux/man-pages/man2/madvise.2.html the kernel can somehow detect these identical memory and then use a single instance for all identical regions of memory?

The memory size is a side effect of the way fork works (formally, it creates a new process by duplicating the old one, what "really" happens is up to the CPU's MMU and the kernel's memory management). I believe that as long as copy-on-write is enabled it only actually copies pages with changes.

But currently Mailman3 does fork+exec, so it doesn't get to share the parent's pages. I experimented with fork-and-dont-exec [0], but the results were underwhelming, because reference counting can cause pages to diverge. Surprisingly, gc.freeze didn't seem to help much, so there may have been issues beyond the reference counts.

0: https://gitlab.com/mailman/mailman/-/merge_requests/1093

I think Python just doesn't support sharing code across processes well.

Best, Justus

Stephen J. Turnbull

4:07 p.m.

Justus Winter writes:

...

But currently Mailman3 does fork+exec, so it doesn't get to share the parent's pages. I experimented with fork-and-dont-exec [0], but the results were underwhelming, because reference counting can cause pages to diverge. Surprisingly, gc.freeze didn't seem to help much, so there may have been issues beyond the reference counts. [...] I think Python just doesn't support sharing code across processes well.

Seems likely. I know that Emacsen have always advised running just one process for this reason (also because users usually want all their recent hacks available in all buffers, but memory hogging is a big reason).

Stephen J. Turnbull

February 2024

11:50 a.m.

Hi Justus!

Justus Winter writes:

...

I'd like to contribute a little to Mailman, and I'd like to better understand how the Mailman project is doing nowadays. I have gotten a bugfix merged in the past, but I now have what I think is a fairly uncontroversial cleanup merge request that has neither been merged nor has it gotten comments.

I think Abhilash has been very busy with day job and some major changes at home (congratulations!), and I certainly have been busy with that. I hope to have more time for Mailman development starting in April, since I will be retiring from $DAYJOB on March 31. Mark has been good about routine requests, especially merging in contributions from translator. He and I have been mostly concentrating on list-based support (most of my work recently), and relatively urgent bug-fixes (Mark and Abhilash). There are a few other core devs but all core currently inactive.

...

As FOSDEM is around the corner, are any of you going to be there and are up for a chat?

I will not be. Haven't heard anything from the rest of the crew, but we mostly meet at PyCon. This is kinda difficult this year though as my funding for travel from Japan is gone due to preretirement and Abhilash I believe is in India.

...

Besides cleanups and bugfixes, there are three things I'd like to do:

Improve Mailman to better scale down to small installations

Not sure what you can really do about that without rearchitecting. The full suite of daemons is something like 13, including 3 WSGI processes, the master daemon for mailman and about 7 or 8 runners.

But I'm pretty sure people have run Mailman 3 on a Raspberry Pi. How constrained an environment are you aiming for?

...

Improve latency of messages

What latency are you observing? My last project was getting about 100,000 incoming per day across 20K lists, two incoming runners, 8 outgoing, 1 each for the other Mailman runners. Never saw more than about 5 seconds dwell in the Mailman system, except when the Mailman to outgoing Postfix SMTP connection started glitching. We fixed that by reconfiguring the Mailman host (in Dallas) to use an MX in the same datacenter instead of one in Boston. (!!) And the normal case with a process where I'd do "ls queue/*" evey 5s was completely empty queues. Stuff just didn't stay around long enough for ls to see it.

I see no reason to suppose you can do much better than that, but again, tell me what you're seeing. I'm not experienced in dealing with Mailman at scale, and that host was quite beefy. Still I have a strong feeling that latency is mostly a communication with MTA issue, not in Mailman 3 itself.

...

Implement OpenPGP support

What does that mean?

...

Here are the things I did so far:

I have Mailman running with runners in threads instead of processes, but that is in a proof-of-concept stage at this point and needs some cleaning up

I guess this is supposed to address the resource consumption (memory footprint?) issue?

After working with Mailman 3 and Postfix, I've become fond of the HUPD (HUPD of Uncontrolled Proliferation of Daemons) model of application design, at least for email. I feel *much* more comfortable messing with individual daemons this way, knowing that I can't affect the others. I'm not going to object to providing the threaded version if people want it, but I would object to wholesale conversion to that model without a lot of production experience based on it.

...

(I understand that Mailman is a GNU project that wants copyright assignments, and I have done that in the past for other GNU projects, and would be happy to do that for Mailman, but at the same time I feel like putting up *any* barrier to contributing is unfortunate.)

My experience has been that about 2/3 of resistence has been to any paperwork as such, only about 1/3 to assignment vs. some sort of formal license ("contributor agreement", as the PSF calls it).

As far as Mailman is concerned, a lot of the core code has been completely rewritten for Mailman 3. However, I know that in implementing Mailman 2 features not yet in Mailman 3 I've been at least heavily influenced by Mailman 2 code. Not sure that anybody else has been particularly careful about "clean implementations", although Barry has said that the core of Mailman 3 core is completely rewritten from scratch. In any case, the last time licensing was discussed, the founder (John Viega) was not on board with a separation from GNU and a permissive license, and Barry and I at least are pretty sentimental about that. For those reasons, I believe at at least this generation of Mailman core devs is unlikely to move in that direction.

I will take a look at the work you mention, but it will be a couple of weeks at least before I have useful comments.

Steve

Justus Winter

March 2024

5:36 p.m.

Hi :)

"Stephen J. Turnbull" <turnbull.stephen.fw@u.tsukuba.ac.jp> writes:

...

Hi Justus!

...
Besides cleanups and bugfixes, there are three things I'd like to do:

Improve Mailman to better scale down to small installations

Not sure what you can really do about that without rearchitecting. The full suite of daemons is something like 13, including 3 WSGI processes, the master daemon for mailman and about 7 or 8 runners.

But I'm pretty sure people have run Mailman 3 on a Raspberry Pi. How constrained an environment are you aiming for?

I had problems on my shared hoster that provided 1 gigabyte of RAM per user (I'm not a 100% on how they measure that). I first noticed the problem because every now and then the OOM killer would kill a Mailman runner process, and because of a bug in the master process [0] it wasn't restarted, resulting in stalled mail processing with no indication, quite frustrating.

0: https://gitlab.com/mailman/mailman/-/merge_requests/1094

And, while I fixed the reliability issue, seeing my small installations (I'd be surprised if we see more than 1 message per day on average) consume so much memory was frustrating [1].

1: https://gitlab.com/mailman/mailman/-/issues/1050

...

...

Improve latency of messages

What latency are you observing? My last project was getting about 100,000 incoming per day across 20K lists, two incoming runners, 8 outgoing, 1 each for the other Mailman runners. Never saw more than about 5 seconds dwell in the Mailman system, except when the Mailman to outgoing Postfix SMTP connection started glitching. We fixed that by reconfiguring the Mailman host (in Dallas) to use an MX in the same datacenter instead of one in Boston. (!!) And the normal case with a process where I'd do "ls queue/*" evey 5s was completely empty queues. Stuff just didn't stay around long enough for ls to see it.

I see no reason to suppose you can do much better than that, but again, tell me what you're seeing. I'm not experienced in dealing with Mailman at scale, and that host was quite beefy. Still I have a strong feeling that latency is mostly a communication with MTA issue, not in Mailman 3 itself.

The latency may be currently small, in absolute terms, but this comes at a considerable cost: the runners are polling their queues in loops. My installations that hardly see any traffic at all are all doing: do I have work, no, sleep 1, do I have work, no, sleep 1... I can see that this will amortize in big installations, but for small ones this is quite sad.

And even for big installations, or if we say that efficiency is not important, if a mail goes through the hands of three queue runners, the worst-case latency is three seconds in an otherwise idle installation! We can definitively improve upon that.

The key insight here is that emails in queues don't appear out of thin air, another runner is putting them there. If each runner that goes to sleep does so by waiting on a condition variable associated with its queue, and every runner that deposits a mail into the queue signals the sleeping runners, that latency goes away while at the same time improving efficiency by no longer having to poll the queue every second.

...

...

Implement OpenPGP support

What does that mean?

OpenPGP can be used to provide confidentiality and integrity for email. What exactly that means in the setting of mailing lists varies by threat model and policy. My prototype [2] simply records associations between addresses and OpenPGP certificates by consuming Autocrypt headers [3] and when sending an outgoing mail opportunistically encrypting it if a certificate is known. Details and future work in [2].

2: https://gitlab.com/mailman/mailman/-/merge_requests/1166 3: https://autocrypt.org

...

...
Here are the things I did so far:

I have Mailman running with runners in threads instead of processes, but that is in a proof-of-concept stage at this point and needs some cleaning up

I guess this is supposed to address the resource consumption (memory footprint?) issue?

Yes.

...

After working with Mailman 3 and Postfix, I've become fond of the HUPD (HUPD of Uncontrolled Proliferation of Daemons) model of application design, at least for email. I feel *much* more comfortable messing with individual daemons this way, knowing that I can't affect the others. I'm not going to object to providing the threaded version if people want it, but I would object to wholesale conversion to that model without a lot of production experience based on it.

My prototype let's you chose, for every kind of runner, whether to use the process or thread model, so it is actually a continuum between the current model, and using threads for all runners (with the exception of the REST runner, because gunicorn doesn't like to be run in the non-main thread).

I don't quite buy (or maybe I'm not understanding the whole picture) into the argument that having individual processes improves the robustness of the whole system.

From my experience, having individual runners killed can render Mailman unusable [0] (and to my then untrained eye it was impossible to see that a runner was missing, if on the other hand Mailman would have been a single process, or a significantly smaller number of processes, a single missing process would have been more apparent), and when a runner has picked up a mail from a queue, and then crashes, that mail is lost forever (i.e. runner operations are not atomic).

...

...
(I understand that Mailman is a GNU project that wants copyright assignments, and I have done that in the past for other GNU projects, and would be happy to do that for Mailman, but at the same time I feel like putting up *any* barrier to contributing is unfortunate.)

My experience has been that about 2/3 of resistence has been to any paperwork as such, only about 1/3 to assignment vs. some sort of formal license ("contributor agreement", as the PSF calls it).

As far as Mailman is concerned, a lot of the core code has been completely rewritten for Mailman 3. However, I know that in implementing Mailman 2 features not yet in Mailman 3 I've been at least heavily influenced by Mailman 2 code. Not sure that anybody else has been particularly careful about "clean implementations", although Barry has said that the core of Mailman 3 core is completely rewritten from scratch. In any case, the last time licensing was discussed, the founder (John Viega) was not on board with a separation from GNU and a permissive license, and Barry and I at least are pretty sentimental about that. For those reasons, I believe at at least this generation of Mailman core devs is unlikely to move in that direction.

I have no issue with the license, and I don't want to open a can of worms. I merely observed little activity and was concerned about the project dying, and wanted to mention that reducing barriers to contributions may be a way to attract more developers and drive-by contributions.

...

I will take a look at the work you mention, but it will be a couple of weeks at least before I have useful comments.

Cool, thanks!

Best, Justus

Stephen J. Turnbull

3:19 p.m.

New subject: Optimization for constrained environments and latency

I'm going to split this into three separate threads with appropriate titles. This is #1.

Regarding optimization for constrained environments, I wrote:

...

...
But I'm pretty sure people have run Mailman 3 on a Raspberry Pi. How constrained an environment are you aiming for?

Justus Winter writes:

...

I had problems on my shared hoster that provided 1 gigabyte of RAM per user (I'm not a 100% on how they measure that).

OK, yes, my estimate says that's going to be too tight. I'm seeing about 80MB per runner with a full complement of processes without any slicing being 18. Some of that is shared (IIRC about 5MB/runner according to top), but that only buys back 1 runner's worth. There are several somewhat optional processes (the nntp, archive, command, and 2 WSGI processes for Mailman Web) but that's probably still not quite going to fit into 1GB.

Re polling queues:

...

the runners are polling their queues in loops. My installations that hardly see any traffic at all are all doing: do I have work, no, sleep 1, do I have work, no, sleep 1... I can see that this will amortize in big installations, but for small ones this is quite sad.

I guess, but if it doesn't show in the load average, I'm not sure why one should care. I don't know about your installation, but Mailman consumes less than 1% of CPU when idle as far as I can tell. For me to support a change here, either you'd need to show a non-negligible improvement or it would have to be "free" (see below).

...

...
...

Improve latency of messages

What latency are you observing?

...

And even for big installations, or if we say that efficiency is not important, if a mail goes through the hands of three queue runners, the worst-case latency is three seconds in an otherwise idle installation! We can definitively improve upon that.

Who will notice? Is there anybody who cares about a 3s latency in list email? If there is, that would be a user-visible improvement to set against any increase in code complexity.

...

The key insight here is that emails in queues don't appear out of thin air, another runner is putting them there. If each runner that goes to sleep does so by waiting on a condition variable associated with its queue, and every runner that deposits a mail into the queue signals the sleeping runners, that latency goes away while at the same time improving efficiency by no longer having to poll the queue every second.

Thing is, email (and Mailman specifically) operates on a store-and- forward model. The queue file *must* be present for a runner to do its work, and conversely, if a file is present the runner has work to do. Polling is a little ugly, but it's a perfect fit for the problem semantically, and very simple to explain and implement.

If the condition-variable-based code is equally simple, equally reliable, and identical across our supported platforms, sure, that's worth looking at because we get your developer-visible efficiency enhancements for "free". But if any of those requirements fails, I would want to see an improvement in user-visible performance.

...

...
I will take a look at the work you mention, but it will be a couple of weeks at least before I have useful comments.

Still need some time for this, but I wanted to get some stuff out of my inbox. :-)

Stephen J. Turnbull

4 p.m.

New subject: Threads and robustness against runner crashes

Split thread #2.

Justus Winter writes:

...

...
...
Here are the things I did so far:

I have Mailman running with runners in threads instead of processes, but that is in a proof-of-concept stage at this point and needs some cleaning up

After working with Mailman 3 and Postfix, I've become fond of the HUPD (HUPD of Uncontrolled Proliferation of Daemons) model of application design, at least for email.

My prototype let's you chose, for every kind of runner, whether to use the process or thread model

That's not a sales point, as far as I'm concerned. It adds complexity for the installer and the site manager, as well as in the code.

...

I don't quite buy (or maybe I'm not understanding the whole picture) into the argument that having individual processes improves the robustness of the whole system.

I'm talking about the developer/maintainer experience, not about run time.

...

From my experience, having individual runners killed can render Mailman unusable [0] (and to my then untrained eye it was impossible to see that a runner was missing,

That's some combination of documentation, logging, and tooling bugs. At the very least "mailman status" should report whether all the runners that were started are still present (it doesn't at present).

It's really not hard to detect a crashed or stalled runner, even in a sliced (multirunner) queue -- queuefiles start to pile up. (By "not hard" I mean you can use "ls" or "du", not that it should be obvious what to do.)

...

if on the other hand Mailman would have been a single process, or a significantly smaller number of processes, a single missing process would have been more apparent),

True, but to me crashes in a monolithic program are less acceptable, expecially threaded, because other concurrent operations may depend on that program staying alive. The way exception handling is done in Mailman 2 with a big "except Exception" around the whole program, you mostly would not get a crash at all, just a log message with an traceback, probably unintelligible to a non-developer of Mailman. Not clear that's a win over the current situation for you. Sure, you can probably arrange for exception handling to be per-thread in some sense, but that's going to be conceptually harder than the the "log the exception, let it crash, have the master restart it and pray" approach we use in the multiprocess model.

...

and when a runner has picked up a mail from a queue, and then crashes, that mail is lost forever (i.e. runner operations are not atomic).

Please report such incidents in as much detail as you can. The whole point of "store and forward" is to prevent that. Runners should not alter the queuefile until they're done. If they crash in the middle, they should leave the queuefile they received and maybe a work file.

Mark Sapiro

5:13 p.m.

New subject: Threads and robustness against runner crashes

On 3/4/24 8:00 AM, Stephen J. Turnbull wrote:

...

Split thread #2.

Justus Winter writes:

...
and when a runner has picked up a mail from a queue, and then crashes, that mail is lost forever (i.e. runner operations are not atomic).

Please report such incidents in as much detail as you can. The whole point of "store and forward" is to prevent that. Runners should not alter the queuefile until they're done. If they crash in the middle, they should leave the queuefile they received and maybe a work file.

The actual process of picking up a queue entry[1] atomically renames the queue file from .pck to .bak so until the runner finishes processing the file and removes the .bak, there is always a .pck or .bak file in the queue. If the runner dies for any reason in processing, whether because of a crash or external event, upon restart it will process the .bak file(s) so messages are never lost for this reason.

[1] https://gitlab.com/mailman/mailman/-/blob/master/src/mailman/core/switchboar...

-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Stephen J. Turnbull

4:01 p.m.

New subject: OpenPGP support

This is split thread #3.

Justus Winter writes:

...

...
...

Implement OpenPGP support

What does that mean?

OpenPGP can be used to provide confidentiality and integrity for email. What exactly that means in the setting of mailing lists varies by threat model and policy.

I was afraid you'd say that. I mean, it's the right generic answer, but I've yet to see a viable use case with a plausible threat model for any of the implementations proposed.

...

My prototype [2] simply records associations between addresses and OpenPGP certificates by consuming Autocrypt headers [3] and when sending an outgoing mail opportunistically encrypting it if a certificate is known.

Except for the Autocrypt part, this has been done. But there are two problems: nobody wants it very badly (see this post specifically <https://mail.python.org/archives/list/mailman-users@python.org/message/STX76...> and the surrounding thread is also valuable because you'll see all the reasons why I don't want to do this in Mailman at present, and you're the first person in decades I think has a good shot at convincing me otherwise! :-) The second problem is I don't see a convincing use case. Note: I don't consider the opportunistic encryption aspect a serious flaw. Obviously this initial proposal is mostly a proof-of- concept and most (all?) serious applications simply wouldn't send unencrypted mail.

Steve

355

Age (days ago)

398

Last active (days ago)

List overview

Download

10 comments

4 participants

participants (4)

Justus Winter
Mark Sapiro
Stephen J. Turnbull
Дилян Палаузов

Introduction, FOSDEM, scaling down, latency, OpenPGP support

tags

participants (4)