[Mailman-Developers] BounceRunner optimization and problems with VERY LARGE lists
Hi, I'm new to the list and have been using Mailman 2.1 for about month on a single list of 138,000 subscribers (It's a legit opt-in announce only list for my wife's website that's taken 3 years to grow to this size).
A typical mailing of this list generates over 5,000 bounces, giving BouncerRunner A LOT of work. At best, BounceRunner can process these at 20/minute on my 2GHZ P4 Redhat mail server, taking over 4 hours best case to finish (running at CPU 90% utilization). However, when it is doing all this work I have the following problems:
(1) The Web gui Membership management page times out and is unusable (2) If CommandRunner starts processing commands I get "couldn't get list lock" errors in BounceRunner's log (Of course losing the Bounce updates to the list)
After studying the code in the Runner class (and BounceRunner in particular) I believe I have a solution to these problems but I wanted to get a sanity check from everyone on the list BEFORE beginning rewriting the code.
Here is a greatly simplified overview of how I understand BounceRunner currently processes bounces Mailman V 2.1 code: I've highlighted the troublespots in CAPS
While Forever (Process all the emails we find in the bounce queue) For Every email in queue REREAD list from disk Dequeue the message Extract addresses to bounce LOCK the LIST For Every address in message Register Bounce SAVE the list to disk UNLOCK the list If we didn't PROCESS ANY EMAILS on last pass Then SLEEP for SLEEPTIME CLEANUP ON EXIT
I believe that these are the troublespots that have been causing me problems:
(1) The list SAVE is executed once for every bounced email. For my big list, that's 13 Megabytes of data written and read back from the disk for EVERY bounce email. Which is why it takes 2-3 seconds to process an email.
(2) BounceRunner is VERY greedy about the list lock. The time "window" for other processes to acquire a list lock is VERY short when the bounce queue is filling or full. In this case, the lock is only open for the time it takes to extract the addresses from the next email! In additon, because we ONLY sleep when the QUEUE is empty this behavior can exist for HOURS on a large list.
Here's my version of the new improved BounceRunner
intialize x to number of bounces to process on each pass While Forever Initialize Python list structure to hold bounces (Process x emails in the bounce queue) For x emails in queue Dequeue the message Extract addresses to bounce SAVE address and Listname in Python list structure If Python List structure contains emails For all mailing lists in Python structure REREAD list from disk LOCK the LIST For all addresses that bounced for this list Register Bounce SAVE the list to disk UNLOCK the list SLEEP for SLEEPTIME CLEANUP on exit
Advantages to this method:
(1) We process a number of bounces before writing out the list reducing I/O (the real bootleneck) by factor x. When x is one the algorithm almost degenerates to the current method
(2) Since we always sleep on each pass it gives other processes (like the Web gui) a chance to read the list.
(3) By increasing x we control the number of bounces that get processed on each pass. The time it takes to extract the addresses gives other processes time to acquire the list lock and avoid "lockout"
(4) Since "in memory" bounce registration is very fast we can do a lot of them while the list is locked without adding significantly to the already long lock time on a big list (I believe the I/O is the limiting factor)
Disadvantages:
(1) A larger number of bounces could be lost if we can't acquire the list lock to update the list. If desired, we could write the extracted addresses to a file to allow easier recovery in this situation. However, since they are just bounces it's not a huge loss anyway.
(2) The processing time for the larger number of bounces WILL be greater than the single bounce processed now. How much more I don't know. This will mean that the list will be locked for a longer period on each pass. However, it will be locked LESS frequently since the bounces can be cleared from the queue faster.
I'm thinking of a similar strategy for CommandRunner, since that is my other resource hog, taking 2-5 seconds per subscribe or unsubscribe.
Thoughts? Comments? Suggestions? I'm interested in any and all responses.
participants (1)
-
John