Mailman 2.1.6
Python 2.4.1
27 Lists
Average 8000 people per list
Average message size 100k
Largest list gets around 200 messages per day
Total load average 2million messages per day give or take.
We have been using mailman successfully for about 4 years.
Basic Problem:
Since about 12/2004 The bounce processor consumes most of the memory in
the system and sometimes all of the CPU.
What we have tried:
* Change REGISTER_BOUNCES_EVERY = minutes(1) which change us from
complete crashing to just consuming all memory and cpu.
* Temporarily shut off bounce processing on all lists, no effect.
* Temporarily reroute bounce messages to a mailbox for human
processing, We received about 5,000 per hour most of which were temporary
(mailbox full, mail server not answering, etc)
* When it gets really bad Kill and restart mailman every 2 minutes.
Mostly works but kills all request in progress when the kill comes through,
however those request we already dead because of the locks on the lists
* Contacting the bug forum on Source forge and being told that bounce
processing is a problem that they hope to have time to fix in 2.1.6 (see
1077587)
What top often looks like:
09:39:33 up 36 days, 15:58, 9 users, load average: 5.23, 5.75, 4.98
497 processes: 485 sleeping, 8 running, 4 zombie, 0 stopped
CPU states: 67.8% user 16.6% system 0.0% nice 0.0% iowait 15.4% idle
Mem: 1029884k av, 973668k used, 56216k free, 0k shrd, 149788k
buff
739156k actv, 58428k in_d, 17080k in_c
Swap: 2040244k av, 188092k used, 1852152k free 184036k
cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
14978 mailman 16 0 324M 318M 2328 R 6.5 31.6 5:46 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=BounceRunner:0:1
14977 mailman 15 0 28916 27M 2420 S 0.0 2.7 0:31 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=ArchRunner:0:1 -s
14983 mailman 15 0 27608 25M 2320 S 0.2 2.5 0:23 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=VirginRunner:0:1
14982 mailman 15 0 23832 22M 2460 S 0.7 2.2 0:58 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=OutgoingRunner:0:
14980 mailman 15 0 22904 21M 2364 S 0.0 2.1 0:27 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=IncomingRunner:0:
14981 mailman 15 0 5508 2568 2180 S 0.0 0.2 0:00 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=NewsRunner:0:1 -s
14984 mailman 15 0 5460 2548 2180 S 0.0 0.2 0:00 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=RetryRunner:0:1 -
14979 mailman 15 0 5476 2540 2180 S 0.0 0.2 0:00 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=CommandRunner:0:1
14976 mailman 25 0 5524 2220 2216 S 0.0 0.2 0:00 0
/usr/bin/python /var/mailman/bin/mailmanctl -s -q start
And immediately after a shutdown restart cycle:
09:42:12 up 36 days, 16:01, 9 users, load average: 4.21, 5.23, 4.91
532 processes: 521 sleeping, 7 running, 4 zombie, 0 stopped
CPU states: 82.3% user 6.3% system 0.0% nice 0.0% iowait 11.3% idle
Mem: 1029884k av, 643420k used, 386464k free, 0k shrd, 149556k
buff
435668k actv, 33664k in_d, 16752k in_c
Swap: 2040244k av, 111832k used, 1928412k free 161048k
cached
PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME CPU COMMAND
32382 mailman 16 0 24312 23M 2660 R 27.2 2.3 0:03 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=ArchRunner:0:1 -s
32385 mailman 15 0 22036 21M 2616 R 23.0 2.1 0:02 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=IncomingRunner:0:
32388 mailman 15 0 21816 21M 2584 S 14.1 2.1 0:01 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=VirginRunner:0:1
32383 mailman 16 0 20604 20M 2568 R 6.1 2.0 0:03 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=BounceRunner:0:1
32387 mailman 15 0 15960 15M 2712 S 0.5 1.5 0:02 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=OutgoingRunner:0:
32379 mailman 25 0 5900 5900 2596 S 0.0 0.5 0:00 0
/usr/bin/python /var/mailman/bin/mailmanctl -s -q start
32386 mailman 15 0 5840 5840 2512 S 0.0 0.5 0:00 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=NewsRunner:0:1 -s
32384 mailman 15 0 5804 5804 2512 S 0.0 0.5 0:00 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=CommandRunner:0:1
32389 mailman 25 0 5792 5792 2512 S 0.0 0.5 0:00 0
/usr/local/bin/python /var/mailman/bin/qrunner --runner=RetryRunner:0:1 -
Notice how the other tasks suddenly get busy....
Log files were completely flushed at 4PM yesterday.
[mailman@tbnonline ~]$ ls -al logs
total 16144
drwxrwsr-x 2 mailman mailman 4096 Aug 31 19:19 .
drwxrwsr-x 34 mailman mailman 4096 Sep 1 08:40 ..
-rw-rw-r-- 1 mailman mailman 15793138 Sep 1 09:02 bounce
-rw-rw-r-- 1 smmsp mailman 335919 Sep 1 09:01 error
-rw-rw-r-- 1 mailman mailman 5181 Sep 1 06:25 locks
-rw-rw-r-- 1 mailman mailman 28700 Sep 1 09:01 post
-rw-rw-r-- 1 mailman mailman 35858 Sep 1 08:32 qrunner
-rw-rw-r-- 1 mailman mailman 263590 Sep 1 09:02 smtp
-rw-rw-r-- 1 mailman mailman 1519 Sep 1 06:31 smtp-failure
-rw-rw-r-- 1 mailman mailman 6833 Sep 1 08:59 subscribe
-rw-rw-r-- 1 mailman mailman 1069 Sep 1 08:32 vette
Sample from bounce log:
Sep 01 08:58:30 2005 (14978) bounce message w/no discernable addresses: <
<mailto:20050901125626.19066.qmail@smtp02-01.prod.mesa1.secureserver.net>
20050901125626.19066.qmail(a)smtp02-01.prod.mesa1.secureserver.net>
Sep 01 08:58:30 2005 (14978) discarding unrecognized, message-id: <
<mailto:20050901125626.19066.qmail@smtp02-01.prod.mesa1.secureserver.net>
20050901125626.19066.qmail(a)smtp02-01.prod.mesa1.secureserver.net>
Sep 01 08:58:32 2005 (14978) bounce message w/no discernable addresses: <
<mailto:Terrace26313811125579171@tbnonline.com>
Terrace26313811125579171(a)tbnonline.com>
Sep 01 08:58:32 2005 (14978) discarding unrecognized, message-id: <
<mailto:Terrace26313811125579171@tbnonline.com>
Terrace26313811125579171(a)tbnonline.com>
Sep 01 08:58:34 2005 (14978) bounce message w/no discernable addresses: <
<mailto:0MKvAI-1EAocE2a2R-0002jN@mx.perfora.net>
0MKvAI-1EAocE2a2R-0002jN(a)mx.perfora.net>
Sep 01 08:58:34 2005 (14978) discarding unrecognized, message-id: <
<mailto:0MKvAI-1EAocE2a2R-0002jN@mx.perfora.net>
0MKvAI-1EAocE2a2R-0002jN(a)mx.perfora.net>
Sep 01 08:58:35 2005 (14978) bounce message w/no discernable addresses: <
<mailto:20050901125631.ABF4E38E49@tbnonline.org>
20050901125631.ABF4E38E49(a)tbnonline.org>
Sep 01 08:58:35 2005 (14978) discarding unrecognized, message-id: <
<mailto:20050901125631.ABF4E38E49@tbnonline.org>
20050901125631.ABF4E38E49(a)tbnonline.org>
Sep 01 08:58:39 2005 (14978) bounce message w/no discernable addresses: <
<mailto:20050901125636.19173.qmail@smtp03-01.prod.mesa1.secureserver.net>
20050901125636.19173.qmail(a)smtp03-01.prod.mesa1.secureserver.net>
Sep 01 08:58:39 2005 (14978) discarding unrecognized, message-id: <
<mailto:20050901125636.19173.qmail@smtp03-01.prod.mesa1.secureserver.net>
20050901125636.19173.qmail(a)smtp03-01.prod.mesa1.secureserver.net>
Sample from error log:
Sep 01 08:57:32 2005 (14978) Ignoring unparseable message:
1125579452.21312+848e219740e9c93ca6f761384c120d97593ee364
Sep 01 08:57:58 2005 (14978) Uncaught runner exception: No terminating
boundary and no trailing empty line
Sep 01 08:57:58 2005 (14978) Traceback (most recent call last):
File "/var/mailman/Mailman/Queue/Runner.py", line 99, in _oneloop
msg, msgdata = self._switchboard.dequeue(filebase)
File "/var/mailman/Mailman/Queue/Switchboard.py", line 151, in dequeue
msg = email.message_from_string(msg, Message.Message)
File "/var/mailman/pythonlib/email/__init__.py", line 52, in
message_from_string
return Parser(_class, strict=strict).parsestr(s)
File "/var/mailman/pythonlib/email/Parser.py", line 75, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
File "/var/mailman/pythonlib/email/Parser.py", line 64, in parse
self._parsebody(root, fp, firstbodyline)
File "/var/mailman/pythonlib/email/Parser.py", line 240, in _parsebody
msgobj = self.parsestr(part)
File "/var/mailman/pythonlib/email/Parser.py", line 75, in parsestr
return self.parse(StringIO(text), headersonly=headersonly)
File "/var/mailman/pythonlib/email/Parser.py", line 64, in parse
self._parsebody(root, fp, firstbodyline)
File "/var/mailman/pythonlib/email/Parser.py", line 265, in _parsebody
msg = self.parse(fp)
File "/var/mailman/pythonlib/email/Parser.py", line 64, in parse
self._parsebody(root, fp, firstbodyline)
File "/var/mailman/pythonlib/email/Parser.py", line 206, in _parsebody
raise Errors.BoundaryError(
BoundaryError: No terminating boundary and no trailing empty line
Sep 01 08:57:58 2005 (14978) Ignoring unparseable message:
1125579478.065345+8ca74677621e4e9262be5503f2e081184abc853e
Sample from smtp log:
Sep 01 09:02:32 2005 (14982) <
<mailto:mailman.68.1125579745.14978.brdcst-compaq@tbnonline.com>
mailman.68.1125579745.14978.brdcst-compaq(a)tbnonline.com> smtp to
brdcst-compaq for 1 recips, completed in 0.253 seconds
Sep 01 09:02:32 2005 (14982) <
<mailto:mailman.69.1125579745.14978.brdcst-compaq@tbnonline.com>
mailman.69.1125579745.14978.brdcst-compaq(a)tbnonline.com> smtp to
brdcst-compaq for 1 recips, completed in 0.432 seconds
Sep 01 09:02:32 2005 (14982) <
<mailto:mailman.70.1125579745.14978.brdcst-compaq@tbnonline.com>
mailman.70.1125579745.14978.brdcst-compaq(a)tbnonline.com> smtp to
brdcst-compaq for 1 recips, completed in 0.140 seconds
Sep 01 09:02:33 2005 (14982) <
<mailto:mailman.71.1125579745.14978.brdcst-compaq@tbnonline.com>
mailman.71.1125579745.14978.brdcst-compaq(a)tbnonline.com> smtp to
brdcst-compaq for 1 recips, completed in 0.315 seconds
Sep 01 09:02:42 2005 (14982) <
<mailto:mailman.61.1125579744.14978.brdcst-printer@tbnonline.com>
mailman.61.1125579744.14978.brdcst-printer(a)tbnonline.com> smtp to
brdcst-printer for 2 recips, completed in 4.212 seconds
Sep 01 09:02:44 2005 (14982) <
<mailto:mailman.65.1125579744.14978.brdcst-dell@tbnonline.com>
mailman.65.1125579744.14978.brdcst-dell(a)tbnonline.com> smtp to brdcst-dell
for 2 recips, completed in 1.629 seconds
Sep 01 09:02:44 2005 (14982) <
<mailto:mailman.71.1125579745.14978.brdcst-compaq@tbnonline.com>
mailman.71.1125579745.14978.brdcst-compaq(a)tbnonline.com> smtp to
brdcst-compaq for 2 recips, completed in 0.842 seconds
Sep 01 09:02:45 2005 (14982) <
<mailto:mailman.63.1125579744.14978.brdcst-printer@tbnonline.com>
mailman.63.1125579744.14978.brdcst-printer(a)tbnonline.com> smtp to
brdcst-printer for 2 recips, completed in 0.427 seconds
Sep 01 09:02:45 2005 (14982) <
<mailto:mailman.69.1125579745.14978.brdcst-compaq@tbnonline.com>
mailman.69.1125579745.14978.brdcst-compaq(a)tbnonline.com> smtp to
brdcst-compaq for 2 recips, completed in 0.391 seconds
Sep 01 09:02:46 2005 (14982) <
<mailto:mailman.67.1125579744.14978.brdcst-compaq@tbnonline.com>
mailman.67.1125579744.14978.brdcst-compaq(a)tbnonline.com> smtp to
brdcst-compaq for 2 recips, completed in 0.374 seconds