Hello, Occasionally my mailman instance (2.1.9) gets into a weird state where one or more of its OutgoingRunner processes appears to hang (usually on a large email with a large number of recipients), causing a backlog of all other mail on that process's "shard" (or whatever the terminology is for how mailman divides up mail between runners based on hash). When it gets into this state, doing a mailman restart doesn't manage to successfully kill the "hung" process - it stays around after the restart (along with the mailmanctl instance that started it). Doing a tcpdump on the process usually shows that it's still sending data, but at a trickle (or sometimes not). Any ideas what could cause this, or how to resolve it?
Kevin Bowen kevin.t.bowen@gmail.com kevin@ucsd.edu
On 11/14/19 4:05 PM, Kevin Bowen wrote:
Hello, Occasionally my mailman instance (2.1.9) gets into a weird state where one or more of its OutgoingRunner processes appears to hang (usually on a large email with a large number of recipients), causing a backlog of all other mail on that process's "shard" (or whatever the terminology is for how mailman divides up mail between runners based on hash).
FYI, "slice" is the term we use.
When it gets into this state, doing a mailman restart doesn't manage to successfully kill the "hung" process - it stays around after the restart (along with the mailmanctl instance that started it). Doing a tcpdump on the process usually shows that it's still sending data, but at a trickle (or sometimes not). Any ideas what could cause this, or how to resolve it?
OutgoingRunner is delivering the message it's working on to the recipient list. If the process is still actually delivering to the outgoing MTA, but slowly, this is an issue between Mailman and the MTA.
One thing you can do is set up a separate port in the MTA for delivery only from Mailman and do little or no checking on that port. For example with Postfix, this is what we have in master.cf on mail.python.org
# This is where mailman is injecting to (no filtering!) 127.0.0.1:8027 inet n - - - - smtpd -o smtpd_authorized_xforward_hosts=127.0.0.0/8 -o mynetworks=127.0.0.0/8 -o smtpd_recipient_restrictions=permit_mynetworks,reject -o smtpd_client_restrictions= -o smtpd_helo_restrictions= -o smtpd_sender_restrictions= -o smtpd_data_restrictions= # -o smtpd_milters=inet:127.0.0.1:11332 -o smtpd_milters=inet:127.0.0.1:8891 # inet:127.0.0.1:8891 == opendkim # inet:127.0.0.1:11332 == rspamd
Some other hints can be found by searching the FAQ at https://wiki.list.org/ for 'performance'
-- Mark Sapiro mark@msapiro.net The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
One thing you can do is set up a separate port in the MTA for delivery
Unfortunately we nowadays use a hosted MTA solution, so I'm not in control of it.
If the process is still actually delivering to the outgoing MTA, but slowly, this is an issue between Mailman and the MTA. Sometimes the process appears to still be delivering, but VERY slowly, other times it still has an open TCP connection but with no data appearing to be sent over it, other times it seems the connection has actually died (but the process still lives). I don't doubt that the MTA is to blame somehow, but I'm not sure how to go about recovering from it. When it gets into this state often the only way I'm able to get mail flowing again is to shut down mailman, remove the .bak file from the out spool, and restart mailman, but this means I'm losing mail, correct?
Kevin Bowen kevin.t.bowen@gmail.com kevin@ucsd.edu
On Thu, Nov 14, 2019 at 4:54 PM Mark Sapiro mark@msapiro.net wrote:
On 11/14/19 4:05 PM, Kevin Bowen wrote:
Hello, Occasionally my mailman instance (2.1.9) gets into a weird state where one or more of its OutgoingRunner processes appears to hang (usually on a large email with a large number of recipients), causing a backlog of all other mail on that process's "shard" (or whatever the terminology is for how mailman divides up mail between runners based on hash).
FYI, "slice" is the term we use.
When it gets into this state, doing a mailman restart doesn't manage to successfully kill the "hung" process - it stays around after the restart (along with the mailmanctl instance that started it). Doing a tcpdump on the process usually shows that it's still sending data, but at a trickle (or sometimes not). Any ideas what could cause this, or how to resolve it?
OutgoingRunner is delivering the message it's working on to the recipient list. If the process is still actually delivering to the outgoing MTA, but slowly, this is an issue between Mailman and the MTA.
One thing you can do is set up a separate port in the MTA for delivery only from Mailman and do little or no checking on that port. For example with Postfix, this is what we have in master.cf on mail.python.org
# This is where mailman is injecting to (no filtering!) 127.0.0.1:8027 inet n - - - - smtpd -o smtpd_authorized_xforward_hosts=127.0.0.0/8 -o mynetworks=127.0.0.0/8 -o smtpd_recipient_restrictions=permit_mynetworks,reject -o smtpd_client_restrictions= -o smtpd_helo_restrictions= -o smtpd_sender_restrictions= -o smtpd_data_restrictions= # -o smtpd_milters=inet:127.0.0.1:11332 -o smtpd_milters=inet:127.0.0.1:8891 # inet:127.0.0.1:8891 == opendkim # inet:127.0.0.1:11332 == rspamd
Some other hints can be found by searching the FAQ at https://wiki.list.org/ for 'performance'
-- Mark Sapiro mark@msapiro.net The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mailman-Users mailing list Mailman-Users@python.org https://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: https://mail.python.org/mailman/options/mailman-users/kevin.t.bowen%40gmail....
On 11/14/19 5:51 PM, Kevin Bowen wrote:
If the process is still actually delivering to the outgoing MTA, but slowly, this is an issue between Mailman and the MTA. Sometimes the process appears to still be delivering, but VERY slowly, other times it still has an open TCP connection but with no data appearing to be sent over it, other times it seems the connection has actually died (but the process still lives). I don't doubt that the MTA is to blame somehow, but I'm not sure how to go about recovering from it.
Almost always, these delays are due to lack of response from the MTA. I.e., OutgoingRunner is waiting for a reply which has not been sent or has somehow been lost. If the connection to the MTA is actually dropped, OutgoingRunner *should* catch this.
When it gets into this state often the only way I'm able to get mail flowing again is to shut down mailman, remove the .bak file from the out spool, and restart mailman, but this means I'm losing mail, correct?
Yes. You have two choices. Removing the .bak file means any recipients not already delivered to the MTA will be lost. If you don't remove the .bak file, it will be recovered and reprocessed when the runner is restarted. In this case, any recipients that were delivered previously will get duplicates. Also, if the issue is somehow due to the message, it will probably recur upon reprocessing.
One thing you might want to try is setting
SMTPLIB_DEBUG_LEVEL = 1
in mm_cfg.py. This requires Python >= 2.4 (I hope by now everyone is using 2.7) and will produce copious logging of all outgoing SMTP transactions in Mailman's error log. This may help to understand the underlying issue.
-- Mark Sapiro mark@msapiro.net The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
If you don't remove the .bak file, it will be recovered and reprocessed when the runner is restarted. In this case, any recipients that were delivered previously will get duplicates.
Question: say there's a transaction in progress delivering a mail with 10,000 recipients, and you have SMTP_MAX_RCPTS set to say 100. If you restart mailman in the middle of it (leaving the .bak file in place), will it restart the entire transaction, re-sending to all 10,000 recipients, or just the 100-recipient chunk it was working on at the time of the restart?
Also, in the performance tuning doc, it says that smaller settings for SMTP_MAX_RCPTS are more performant (I believe it recommended 10), but if you're sending a mail with a large attachment, doesn't a smaller value here necessitate repeating the data segment of the mail more times?
Kevin Bowen kevin.t.bowen@gmail.com kevin@ucsd.edu
On Thu, Nov 14, 2019 at 6:17 PM Mark Sapiro mark@msapiro.net wrote:
On 11/14/19 5:51 PM, Kevin Bowen wrote:
If the process is still actually delivering to the outgoing MTA, but slowly, this is an issue between Mailman and the MTA. Sometimes the process appears to still be delivering, but VERY slowly, other times it still has an open TCP connection but with no data
appearing
to be sent over it, other times it seems the connection has actually died (but the process still lives). I don't doubt that the MTA is to blame somehow, but I'm not sure how to go about recovering from it.
Almost always, these delays are due to lack of response from the MTA. I.e., OutgoingRunner is waiting for a reply which has not been sent or has somehow been lost. If the connection to the MTA is actually dropped, OutgoingRunner *should* catch this.
When it gets into this state often the only way I'm able to get mail flowing again is to shut down mailman, remove the .bak file from the out spool, and restart mailman, but this means I'm losing mail, correct?
Yes. You have two choices. Removing the .bak file means any recipients not already delivered to the MTA will be lost. If you don't remove the .bak file, it will be recovered and reprocessed when the runner is restarted. In this case, any recipients that were delivered previously will get duplicates. Also, if the issue is somehow due to the message, it will probably recur upon reprocessing.
One thing you might want to try is setting
SMTPLIB_DEBUG_LEVEL = 1
in mm_cfg.py. This requires Python >= 2.4 (I hope by now everyone is using 2.7) and will produce copious logging of all outgoing SMTP transactions in Mailman's error log. This may help to understand the underlying issue.
-- Mark Sapiro mark@msapiro.net The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mailman-Users mailing list Mailman-Users@python.org https://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: https://mail.python.org/mailman/options/mailman-users/kevin.t.bowen%40gmail....
On 11/15/19 11:34 AM, Kevin Bowen wrote:
If you don't remove the .bak file, it will be recovered and reprocessed when the runner is restarted. In this case, any recipients that were delivered previously will get duplicates.
Question: say there's a transaction in progress delivering a mail with 10,000 recipients, and you have SMTP_MAX_RCPTS set to say 100. If you restart mailman in the middle of it (leaving the .bak file in place), will it restart the entire transaction, re-sending to all 10,000 recipients, or just the 100-recipient chunk it was working on at the time of the restart?
All 10,000. The .bak file is the original queue entry, renamed to .bak for recovery at the start of processing by OutgoingRunner. Recovery will increment a count in the metadata so that entries that cause the runner to crash won't be retried forever, and then just requeue the original with all 10,000 recipients.
This process knows nothing about the actual delivery details such as VERP, personalization and chunking (SMTP_MAX_RCPTS).
Also, in the performance tuning doc, it says that smaller settings for SMTP_MAX_RCPTS are more performant (I believe it recommended 10), but if you're sending a mail with a large attachment, doesn't a smaller value here necessitate repeating the data segment of the mail more times?
Yes. The tuning tips were written some time ago when a typical discussion list post was a relatively small amount of text with no attachments. However, if you consider the entire process including the MTA (I realize in your case the MTA is remote and doesn't contribute to load on your server), even if you send one large chunk with 100 recipients, the MTA is probably going to have to deliver that to many MXs and copy the data to each one.
-- Mark Sapiro mark@msapiro.net The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Kevin Bowen
-
Mark Sapiro