[Mailman-Users] Bounce Processing Problems
Brad Knowles
brad at stop.mail-abuse.org
Thu Sep 1 18:28:17 CEST 2005
At 11:35 AM -0400 2005-09-01, Paul Rubin wrote:
> * Contacting the bug forum on Source forge and being told that bounce
> processing is a problem that they hope to have time to fix in 2.1.6 (see
> 1077587)
Note that 2.1.6 has already been shipped, and it did make some
notable improvements in the bounce handling process.
However, I know that it did not completely resolve bounce
handling problems, because we recently ran into some issues on
python.org (running Python 2.4 and Mailman 2.1.6), where we racked up
almost 100,000 bounces in the space of a very few days, in many cases
getting multiple bounces in the span of a single second. I ended up
shutting down Mailman, moving the "bounce" subdirectory to
"bounce.old" and created a new "bounce" directory (with the same
ownership and permissions), and after that I haven't seen these
problems re-occur.
Unfortunately, we don't really know yet what caused our problems
on python.org, but we are still investigating.
> What top often looks like:
>
> 09:39:33 up 36 days, 15:58, 9 users, load average: 5.23, 5.75, 4.98
> 497 processes: 485 sleeping, 8 running, 4 zombie, 0 stopped
> CPU states: 67.8% user 16.6% system 0.0% nice 0.0% iowait 15.4% idle
> Mem: 1029884k av, 973668k used, 56216k free, 0k shrd, 149788k
> buff
> 739156k actv, 58428k in_d, 17080k in_c
> Swap: 2040244k av, 188092k used, 1852152k free 184036k
> cached
>
> PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
> 14978 mailman 16 0 324M 318M 2328 R 6.5 31.6 5:46 0
> /usr/local/bin/python /var/mailman/bin/qrunner --runner=BounceRunner:0:1
Well, BounceRunner is taking up a lot of memory, but I don't see
it using a lot of CPU. Looking at the system with tools like
"iostats" and "vmstats" is usually useful, when you're having
problems like this.
Especially useful is if you have support for the "-x" option to
iostats, because that lets you look at things like the disk I/O wait
queue, and gives you a very good idea if you are being limited by
synchronous meta-data updates, such as are common with excessive lock
contention, or excessive problems with creation/use/deletion of lots
of temporary files in a short period of time, etc....
The "vmstats" tool comes in handy when trying to determine if
you're seeing serious memory pressure, maybe excessive disk cache
thrashing, etc... during periods of high activity. If you see high
rates of page-ins and page-outs, that's a very bad sign. High rates
of page-ins, in combination with relatively little free memory, is
not necessarily a bad sign -- the system may just be using memory
that would otherwise be sitting idle as an expanded disk cache. But
high rates of page-outs is bad news.
You might also want to consider turning on lock debugging,
pending lockdb debugging, etc... in your mm_cfg.py.
> And immediately after a shutdown restart cycle:
>
> 09:42:12 up 36 days, 16:01, 9 users, load average: 4.21, 5.23, 4.91
> 532 processes: 521 sleeping, 7 running, 4 zombie, 0 stopped
> CPU states: 82.3% user 6.3% system 0.0% nice 0.0% iowait 11.3% idle
> Mem: 1029884k av, 643420k used, 386464k free, 0k shrd, 149556k
> buff
> 435668k actv, 33664k in_d, 16752k in_c
> Swap: 2040244k av, 111832k used, 1928412k free 161048k
> cached
>
> PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
> 32382 mailman 16 0 24312 23M 2660 R 27.2 2.3 0:03 0
> /usr/local/bin/python /var/mailman/bin/qrunner --runner=ArchRunner:0:1 -s
> 32385 mailman 15 0 22036 21M 2616 R 23.0 2.1 0:02 0
> /usr/local/bin/python /var/mailman/bin/qrunner --runner=IncomingRunner:0:
> 32388 mailman 15 0 21816 21M 2584 S 14.1 2.1 0:01 0
> /usr/local/bin/python /var/mailman/bin/qrunner --runner=VirginRunner:0:1
> 32383 mailman 16 0 20604 20M 2568 R 6.1 2.0 0:03 0
> /usr/local/bin/python /var/mailman/bin/qrunner --runner=BounceRunner:0:1
Well, you've certainly got a lot more memory showing as free,
memory which I believe may have previously been allocated to
BounceRunner. This is one potential indication that you are actually
suffering from memory shortages during the periods you're discussing.
However, you'd have to do further investigation with tools like
iostat and vmstat during times of trouble, before you'd be likely to
have a good idea of what was really going on.
> 32387 mailman 15 0 15960 15M 2712 S 0.5 1.5 0:02 0
> /usr/local/bin/python /var/mailman/bin/qrunner --runner=OutgoingRunner:0:
> 32379 mailman 25 0 5900 5900 2596 S 0.0 0.5 0:00 0
> /usr/bin/python /var/mailman/bin/mailmanctl -s -q start
> 32386 mailman 15 0 5840 5840 2512 S 0.0 0.5 0:00 0
> /usr/local/bin/python /var/mailman/bin/qrunner --runner=NewsRunner:0:1 -s
> 32384 mailman 15 0 5804 5804 2512 S 0.0 0.5 0:00 0
> /usr/local/bin/python /var/mailman/bin/qrunner --runner=CommandRunner:0:1
> 32389 mailman 25 0 5792 5792 2512 S 0.0 0.5 0:00 0
> /usr/local/bin/python /var/mailman/bin/qrunner --runner=RetryRunner:0:1 -
>
> Notice how the other tasks suddenly get busy....
That's perfectly normal when you restart after a period of having
problems, as the other parts of the system try to catch up to where
they should be.
> Log files were completely flushed at 4PM yesterday.
>
> [mailman at tbnonline ~]$ ls -al logs
> total 16144
> drwxrwsr-x 2 mailman mailman 4096 Aug 31 19:19 .
> drwxrwsr-x 34 mailman mailman 4096 Sep 1 08:40 ..
> -rw-rw-r-- 1 mailman mailman 15793138 Sep 1 09:02 bounce
> -rw-rw-r-- 1 smmsp mailman 335919 Sep 1 09:01 error
> -rw-rw-r-- 1 mailman mailman 5181 Sep 1 06:25 locks
> -rw-rw-r-- 1 mailman mailman 28700 Sep 1 09:01 post
> -rw-rw-r-- 1 mailman mailman 35858 Sep 1 08:32 qrunner
> -rw-rw-r-- 1 mailman mailman 263590 Sep 1 09:02 smtp
> -rw-rw-r-- 1 mailman mailman 1519 Sep 1 06:31 smtp-failure
> -rw-rw-r-- 1 mailman mailman 6833 Sep 1 08:59 subscribe
> -rw-rw-r-- 1 mailman mailman 1069 Sep 1 08:32 vette
That's a *huge* bounce subdirectory. One problem you can run
into is that when you've got more than ~1000-10000 files in a single
subdirectory, just accessing that directory can take a very long time
to do the linear table scan. Trying to create new files in that
directory can take orders of magnitude longer than it used to, just
because the system has to lock the entire directory and then scan the
entire directory, before it can confirm that there are no other files
in the directory with the same name, at which point it can then
release the lock on the directory and return to the caller.
Try moving the current "bounce" directory to something like
"bounce.old", and create a new one with the same ownership and
permissions, then stop and restart Mailman. I'd be willing to bet
that directory lock contention has been a *huge* part of your problem
-- I certainly believe that it was for us on python.org.
--
Brad Knowles, <brad at stop.mail-abuse.org>
"Those who would give up essential Liberty, to purchase a little
temporary Safety, deserve neither Liberty nor Safety."
-- Benjamin Franklin (1706-1790), reply of the Pennsylvania
Assembly to the Governor, November 11, 1755
SAGE member since 1995. See <http://www.sage.org/> for more info.
More information about the Mailman-Users
mailing list