[Mailman-Users] Fix: OutgoingRunner qrunner crash with multiple slices enabled.

Brian Greenberg grnbrg at gmail.com
Mon Aug 16 16:16:37 CEST 2004


Mailman 2.1.5 on Solaris 8, with Python 2.3.3.

I was getting the following errors in logs/error and logs/qrunner:

error:

Aug 13 15:16:53 2004 qrunner(7657): Traceback (most recent call last):
Aug 13 15:16:53 2004 qrunner(7657):   File
"/usr/local/mailman/bin/qrunner", line 270, in ?
Aug 13 15:16:53 2004 qrunner(7657):      main()
Aug 13 15:16:53 2004 qrunner(7657):   File
"/usr/local/mailman/bin/qrunner", line 230, in main
Aug 13 15:16:53 2004 qrunner(7657):      qrunner.run()
Aug 13 15:16:53 2004 qrunner(7657):   File
"/usr/local/mailman/Mailman/Queue/Runner.py", line 70, in run
Aug 13 15:16:53 2004 qrunner(7657):      filecnt = self._oneloop()
Aug 13 15:16:53 2004 qrunner(7657):   File
"/usr/local/mailman/Mailman/Queue/Runner.py", line 99, in _oneloop
Aug 13 15:16:53 2004 qrunner(7657):      msg, msgdata =
self._switchboard.dequeue(filebase)
Aug 13 15:16:53 2004 qrunner(7657):   File
"/usr/local/mailman/Mailman/Queue/Switchboard.py", line 144, in
dequeue
Aug 13 15:16:53 2004 qrunner(7657):      os.unlink(filename)
Aug 13 15:16:53 2004 qrunner(7657): OSError :  [Errno 2] No such file
or directory: '/var/priv/mail/mailman/qfiles/out/1092428211.4786341+bad1265375ae36cc455fc7e521e9c39c09a29558.pck'

qrunner:

 Aug 13 15:16:53 2004 (29188) Master qrunner detected subprocess exit
(pid: 7657, sig: None, sts: 1, class: OutgoingRunner, slice: 3/4)
[restarting]
Aug 13 15:16:54 2004 (7005) OutgoingRunner qrunner started.

with 

Aug 08 05:35:34 2004 (716) Qrunner OutgoingRunner reached maximum restart limit 
of 10, not restarting.

showing up eventually, followed by mail building up in the outgoing queue.

I was running four OutgoingRunner instances, set in mm_cfg.py with:

QRUNNERS = [
    ('ArchRunner',     1), # messages for the archiver
    ('BounceRunner',   1), # for processing the qfile/bounces directory
    ('CommandRunner',  1), # commands and bounces from the outside world
    ('IncomingRunner', 1), # posts from the outside world
    ('NewsRunner',     1), # outgoing messages to the nntpd
    ('OutgoingRunner', 4), # outgoing messages to the smtpd
    ('VirginRunner',   1), # internally crafted (virgin birth) messages
    ('RetryRunner',    1), # retry temporarily failed deliveries
    ]

The problem is a logic error in mailman/Mailman/Queue/Switchboard.py,
and is fixed with a one-line patch.

The problem is:

In Switchboard.py:__init__, the upper and lower bounds (self.__upper
and self.__lower) are both set to "None" if there is only a single
instance of the qrunner class in question, and to the correct upper
and lower bounds of each subslice if there is more than one.

In Switchboard.py:files (which returns a list of all files in the
queue directory that this qrunner instance is to process) the
statement that rejects files that are not within the bounds of this
qrunner instance has a logic error.  The line in question:

     if not lower or (lower <= long(digest, 16) < upper ) :

can be read as "If this is a single instance qrunner (because lower is
set to "None", and therefore false) or if the file is within the upper
and lower bounds of this instance, add it to the list of files."  The
problem is that the first slice of any multi-slice qrunner has a lower
bound of 0, and (not 0) evaluates as true.

This results in slice 0 of any multi-slice qrunner trying to grab
files from the entire queue, rather than it's assigned portion,
resulting in a race condition and the crash of one of the qrunners
when slice 0 and slice n try to process the same file at the same
time.

Patch:

*** Switchboard.py      Fri Aug 13 16:43:12 2004
--- Switchboard.py_new  Fri Aug 13 16:43:48 2004
***************
*** 164,170 ****
              when, digest = filebase.split('+')
              # Throw out any files which don't match our bitrange.  BAW: test
              # performance and end-cases of this algorithm.
!             if not lower or (lower <= long(digest, 16) < upper):
                  times[float(when)] = filebase
          # FIFO sort
          keys = times.keys()
--- 164,170 ----
              when, digest = filebase.split('+')
              # Throw out any files which don't match our bitrange.  BAW: test
              # performance and end-cases of this algorithm.
!             if (lower == upper) or (lower <= long(digest, 16) < upper):
                  times[float(when)] = filebase
          # FIFO sort
          keys = times.keys()


Thanks!

Brian.
-- 
Brian Greenberg
grnbrg at gmail.com



More information about the Mailman-Users mailing list