[Mailman-Users] help wanted: debian woody mailman suddenly stopped with (seemingly) qrunner lock file problem

Ziegler Gábor ziegler at alpha.tmit.bme.hu
Thu Dec 16 23:25:54 CET 2004


Dear gurus,

I run a fairly low-traffic mailman on a stock debian woody server, which 
suddenly stopped to work. I am clueless and looking for help. Details below

My system:
---------
Debian stable (stock deban woody, regularly updated from 
security.debian.org)
Debianized stock mailman package v.2.0.11
Debianized stock Exim package: version 3.35 #1 built 07-May-2004 08:25:17

Symptomps:
----------
A few days ago the server suddenly stopped to process incoming messages, 
they just accumulate in the qfiles subdir.  Admin access via web  is 
working, I can add users, etc. No pending mails reported by the web 
admin gui. Mails are accepted by the MTA w/o complaints, no mail goes 
out to lists, though. Nothing. The non-mailman-related SMTP traffic 
flows as normal.

The server has been running for years w/o any real problem. Running 
out-of-disk-space has happened earlier, but cleaning-up some disk-space 
has always solved problems.

Below comes the summary of my investigations. I am totally clueless 
about the problem any help is highly appreciated.

I repeat: the server has worked for years, no (intentional) config 
changes has happened. There was, however, reports of the server running 
out-of diskspace by a list-admin, but that has been taken care already.

Zeroth examination: disk space check:
-------------------
df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/hdb1             1.9G  1.8G  116M  94% /
/dev/hdb5             3.9G  3.6G  163M  96% /home
/dev/hdb3             1.9G  1.2G  738M  61% /var
/dev/hdb6             3.9G  3.1G  724M  81% /usr/local
/dev/hda1             7.6M  5.6M  1.6M  78% /boot
/dev/hda2             4.7G  3.2G  1.2G  72% /archives-hda2

Note, there is plenty of disk space in /var.

First examination: SMTP works
-----------------------------
According to the logs exim delivers: just an example from the Exim's 
mainlog, showing a succesful delivery to mailman-list "nsht":

2004-12-15 08:34:11 1CeTfn-0002e9-00 <= XXXXXX at tmit.bme.hu 
H=david.tmit.bme.hu [152.66.246.102] P=esmtp S=1865 
id=Pine.GSO.3.96.1041215083328.21437A-100000 at david.tmit.bme.hu
2004-12-15 08:34:12 1CeTfn-0002e9-00 => nsht <nsht at leda.tmit.bme.hu> 
D=list_director T=list_transport
2004-12-15 08:34:12 1CeTfn-0002e9-00 Completed

Furthermore, I actively use this Exim as my everyday default SMTP MTA, 
works just fine fine.

Second examination:  The messages seems to reach the qfiles directory.
----------------------------------------------------------------------
There are various entries like this:
f0fb10de9b998a5a185~aa29819f1395b9.db    size:115  date:Dec 15 23:03
f0fb10de9b998a5a185~a29819f1395b9.msg    size:825  date:Dec 15 23:03
The content of a .db file:
leda:/var/lib/mailman/qfiles# cat -vte 
f0fb10de9b998a5a1858842d62aa29819f1395b9.db
{s^F^@^@^@tolisti^A^@^@^@s^G^@^@^@versioni^B^@^@^@s^H^@^@^@listnames^D^@^@^@nshts^H^@^@^@filebases(^@^@^@f0fb10de9b998a5a18
The content of the .msg file seems normal SMTP envelope and body

The biggest .msg file in this directory is 6656 bytes, therefore 
disk-free-space cannot be the issue.

Third examination: perms seems to O.K.
--------------------------------------
leda:/var/lib/mailman/qfiles# check_perms
No problems found

Fourth examination: checking database of the list of the reporting 
list-admin for list "nsht"
--------------------------------------------------------
leda:/var/lib/mailman/qfiles# check_db nsht
/var/lib/mailman/lists/nsht/config.db is fine
/var/lib/mailman/lists/nsht/config.db.last is fine

Note, that no lists seems to work on the server (there are some tens of 
lists), neither "nsht" nor others.


Fifth examination: checking crontab for mailman
-----------------------------------------------
leda:/var/lib/mailman/qfiles# cat /etc/cron.d/mailman
12,42 * * * *   list    [ -x /usr/bin/python -a -f 
/usr/lib/mailman/cron/run_queue ] && /usr/bin/python 
/usr/lib/mailman/cron/run_queue
# */5 * * * *   list    [ -x /usr/bin/python -a -f 
/usr/lib/mailman/cron/gate_news ] && /usr/bin/python 
/usr/lib/mailman/cron/gate_news
* * * * *       list    [ -x /usr/bin/python -a -f 
/usr/lib/mailman/cron/qrunner ] && /usr/bin/python 
/usr/lib/mailman/cron/qrunner

Cron daemon is up and running. Qrunner script runs every minutes. See 
next examination

Sixth examination: checking mailman logs
--------------------------------------------
Everything seems to normal, except that qrunner continually  emits 
errors at each run to /var/lib/mailman/logs/error, such as these:

Dec 16 00:06:02 2004 qrunner(18367): Traceback (most recent call last):
Dec 16 00:06:02 2004 qrunner(18367):   File
"/usr/lib/mailman/cron/qrunner", line 283, in ?
Dec 16 00:06:02 2004 qrunner(18367):      kids = main(lock)
Dec 16 00:06:02 2004 qrunner(18367):   File
"/usr/lib/mailman/cron/qrunner", line 253, in main
Dec 16 00:06:02 2004 qrunner(18367):      keepqueued =
dispose_message(mlist, msg, msgdata)
Dec 16 00:06:02 2004 qrunner(18367):   File
"/usr/lib/mailman/cron/qrunner", line 121, in dispose_message
Dec 16 00:06:02 2004 qrunner(18367):      if
BouncerAPI.ScanMessages(mlist, mimemsg):
Dec 16 00:06:02 2004 qrunner(18367):   File
"/usr/lib/mailman/Mailman/Bouncers/BouncerAPI.py", line 59, in ScanMessages
Dec 16 00:06:02 2004 qrunner(18367):      addrs = func(msg)
Dec 16 00:06:02 2004 qrunner(18367):   File
"/usr/lib/mailman/Mailman/Bouncers/Postfix.py", line 39, in process
Dec 16 00:06:02 2004 qrunner(18367):      more = mfile.next()
Dec 16 00:06:02 2004 qrunner(18367):   File
"/usr/lib/python2.1/multifile.py", line 123, in next
Dec 16 00:06:02 2004 qrunner(18367):      while self.readline(): pass
Dec 16 00:06:02 2004 qrunner(18367):   File
"/usr/lib/python2.1/multifile.py", line 95, in readline
Dec 16 00:06:02 2004 qrunner(18367):      if marker ==
self.section_divider(sep):
Dec 16 00:06:02 2004 qrunner(18367):   File
"/usr/lib/python2.1/multifile.py", line 159, in section_divider
Dec 16 00:06:02 2004 qrunner(18367):      return "--" + str
Dec 16 00:06:02 2004 qrunner(18367): TypeError :  cannot add type "None"
to string

My attempts to fix the seemingly lock file problem:
--------------------------------------------------
1. Since the reporting list-admin claimed temporary ran-out-of-diskspace 
situation. I double checked the available free space.

2. I have stopped crond, inetd. I have checked that no python process is 
lurking around, then I have checked with "lsof" that any of the 
lock-files in /var/lib/mailman/locks/ are not held open by anyone. All 
lock files was older than several months(!). I have deleted all 
lockfiles. Restarted crontab, inetd. Qrunner still fails with the above 
error log.

3. as a last attempt i have sacrified my 135 days uptime :-( and I have 
rebooted the system, hoping that the Microsoft approach might help.
The system rebooted just fine, but mailman (qrunner) still does not work.

Now I am out of ideas.
Any advice?

Thanks:
Gábor




More information about the Mailman-Users mailing list