[Mailman-Users] Bounce Processing Problems

Thu Sep 1 18:28:17 CEST 2005

At 11:35 AM -0400 2005-09-01, Paul Rubin wrote:

>  *	Contacting the bug forum on Source forge and being told that bounce
>  processing is a problem that they hope to have time to fix in 2.1.6  (see
>  1077587)

	Note that 2.1.6 has already been shipped, and it did make some 
notable improvements in the bounce handling process.

	However, I know that it did not completely resolve bounce 
handling problems, because we recently ran into some issues on 
python.org (running Python 2.4 and Mailman 2.1.6), where we racked up 
almost 100,000 bounces in the space of a very few days, in many cases 
getting multiple bounces in the span of a single second.  I ended up 
shutting down Mailman, moving the "bounce" subdirectory to 
"bounce.old" and created a new "bounce" directory (with the same 
ownership and permissions), and after that I haven't seen these 
problems re-occur.

	Unfortunately, we don't really know yet what caused our problems 
on python.org, but we are still investigating.

>  What top often looks like:
>
>   09:39:33  up 36 days, 15:58,  9 users,  load average: 5.23, 5.75, 4.98
>  497 processes: 485 sleeping, 8 running, 4 zombie, 0 stopped
>  CPU states:  67.8% user  16.6% system   0.0% nice   0.0% iowait  15.4% idle
>  Mem:  1029884k av,  973668k used,   56216k free,       0k shrd,  149788k
>  buff
>                      739156k actv,   58428k in_d,   17080k in_c
>  Swap: 2040244k av,  188092k used, 1852152k free                  184036k
>  cached
>
>    PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
>  14978 mailman   16   0  324M 318M  2328 R     6.5 31.6   5:46   0
>  /usr/local/bin/python /var/mailman/bin/qrunner --runner=BounceRunner:0:1

	Well, BounceRunner is taking up a lot of memory, but I don't see 
it using a lot of CPU.  Looking at the system with tools like 
"iostats" and "vmstats" is usually useful, when you're having 
problems like this.

	Especially useful is if you have support for the "-x" option to 
iostats, because that lets you look at things like the disk I/O wait 
queue, and gives you a very good idea if you are being limited by 
synchronous meta-data updates, such as are common with excessive lock 
contention, or excessive problems with creation/use/deletion of lots 
of temporary files in a short period of time, etc....

	The "vmstats" tool comes in handy when trying to determine if 
you're seeing serious memory pressure, maybe excessive disk cache 
thrashing, etc... during periods of high activity.  If you see high 
rates of page-ins and page-outs, that's a very bad sign.  High rates 
of page-ins, in combination with relatively little free memory, is 
not necessarily a bad sign -- the system may just be using memory 
that would otherwise be sitting idle as an expanded disk cache.  But 
high rates of page-outs is bad news.

	You might also want to consider turning on lock debugging, 
pending lockdb debugging, etc... in your mm_cfg.py.

>  And immediately after a shutdown restart cycle:
>
>   09:42:12  up 36 days, 16:01,  9 users,  load average: 4.21, 5.23, 4.91
>  532 processes: 521 sleeping, 7 running, 4 zombie, 0 stopped
>  CPU states:  82.3% user   6.3% system   0.0% nice   0.0% iowait  11.3% idle
>  Mem:  1029884k av,  643420k used,  386464k free,       0k shrd,  149556k
>  buff
>                      435668k actv,   33664k in_d,   16752k in_c
>  Swap: 2040244k av,  111832k used, 1928412k free                  161048k
>  cached
>
>    PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
>  32382 mailman   16   0 24312  23M  2660 R    27.2  2.3   0:03   0
>  /usr/local/bin/python /var/mailman/bin/qrunner --runner=ArchRunner:0:1 -s
>  32385 mailman   15   0 22036  21M  2616 R    23.0  2.1   0:02   0
>  /usr/local/bin/python /var/mailman/bin/qrunner --runner=IncomingRunner:0:
>  32388 mailman   15   0 21816  21M  2584 S    14.1  2.1   0:01   0
>  /usr/local/bin/python /var/mailman/bin/qrunner --runner=VirginRunner:0:1
>  32383 mailman   16   0 20604  20M  2568 R     6.1  2.0   0:03   0
>  /usr/local/bin/python /var/mailman/bin/qrunner --runner=BounceRunner:0:1

	Well, you've certainly got a lot more memory showing as free, 
memory which I believe may have previously been allocated to 
BounceRunner.  This is one potential indication that you are actually 
suffering from memory shortages during the periods you're discussing.

	However, you'd have to do further investigation with tools like 
iostat and vmstat during times of trouble, before you'd be likely to 
have a good idea of what was really going on.

>  32387 mailman   15   0 15960  15M  2712 S     0.5  1.5   0:02   0
>  /usr/local/bin/python /var/mailman/bin/qrunner --runner=OutgoingRunner:0:
>  32379 mailman   25   0  5900 5900  2596 S     0.0  0.5   0:00   0
>  /usr/bin/python /var/mailman/bin/mailmanctl -s -q start
>  32386 mailman   15   0  5840 5840  2512 S     0.0  0.5   0:00   0
>  /usr/local/bin/python /var/mailman/bin/qrunner --runner=NewsRunner:0:1 -s
>  32384 mailman   15   0  5804 5804  2512 S     0.0  0.5   0:00   0
>  /usr/local/bin/python /var/mailman/bin/qrunner --runner=CommandRunner:0:1
>  32389 mailman   25   0  5792 5792  2512 S     0.0  0.5   0:00   0
>  /usr/local/bin/python /var/mailman/bin/qrunner --runner=RetryRunner:0:1 -
>
>  Notice how the other tasks suddenly get busy....

	That's perfectly normal when you restart after a period of having 
problems, as the other parts of the system try to catch up to where 
they should be.

>  Log files were completely flushed at 4PM yesterday.
>
>  [mailman at tbnonline ~]$ ls -al logs
>  total 16144
>  drwxrwsr-x    2 mailman  mailman      4096 Aug 31 19:19 .
>  drwxrwsr-x   34 mailman  mailman      4096 Sep  1 08:40 ..
>  -rw-rw-r--    1 mailman  mailman  15793138 Sep  1 09:02 bounce
>  -rw-rw-r--    1 smmsp    mailman    335919 Sep  1 09:01 error
>  -rw-rw-r--    1 mailman  mailman      5181 Sep  1 06:25 locks
>  -rw-rw-r--    1 mailman  mailman     28700 Sep  1 09:01 post
>  -rw-rw-r--    1 mailman  mailman     35858 Sep  1 08:32 qrunner
>  -rw-rw-r--    1 mailman  mailman    263590 Sep  1 09:02 smtp
>  -rw-rw-r--    1 mailman  mailman      1519 Sep  1 06:31 smtp-failure
>  -rw-rw-r--    1 mailman  mailman      6833 Sep  1 08:59 subscribe
>  -rw-rw-r--    1 mailman  mailman      1069 Sep  1 08:32 vette

	That's a *huge* bounce subdirectory.  One problem you can run 
into is that when you've got more than ~1000-10000 files in a single 
subdirectory, just accessing that directory can take a very long time 
to do the linear table scan.  Trying to create new files in that 
directory can take orders of magnitude longer than it used to, just 
because the system has to lock the entire directory and then scan the 
entire directory, before it can confirm that there are no other files 
in the directory with the same name, at which point it can then 
release the lock on the directory and return to the caller.

	Try moving the current "bounce" directory to something like 
"bounce.old", and create a new one with the same ownership and 
permissions, then stop and restart Mailman.  I'd be willing to bet 
that directory lock contention has been a *huge* part of your problem 
-- I certainly believe that it was for us on python.org.

-- 
Brad Knowles, <brad at stop.mail-abuse.org>

"Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety."

     -- Benjamin Franklin (1706-1790), reply of the Pennsylvania
     Assembly to the Governor, November 11, 1755

   SAGE member since 1995.  See <http://www.sage.org/> for more info.