[Mailman-Users] Problems with multi-machine slicing

Sun May 25 03:53:19 CEST 2014

Its odd, I could have sworn the slicing used to be done per recipient, 
not per message.  I've had to check logs for a client to confirm her 
messages went out, and generally had check all three machines to verify 
every user received the message.

The rest of the process works as I expected it to.  I think were I 
flubbed up initially (last time I thought it wasn't working right and 
went through my process again) was that I had applied the patch to 
Switchboard.py to all four machines.  Its just weird that it took so 
many restarts (and some reboots) before it started slicing properly 
again.  I haven't made any config changes since I sent my initial plea 
for help to the list last night.

At least now it looks like I have my notes in order and I can finish 
getting rid of the other ubuntu machines.  Thanks for the help!

On 05/24/2014 04:56 PM, Mark Sapiro wrote:
> On 05/24/2014 03:05 PM, Jeff Taylor wrote:
>> After stopping mailman, machine #1 shows:
>> May 24 15:23:34 2014 (11512) Master qrunner detected subprocess exit
>> (pid: 11516, sig: None, sts: 15, class: IncomingRunner, slice: 1/3)
>>
>> Machine #2:
>> May 24 15:21:56 2014 (12767) Master qrunner detected subprocess exit
>> (pid: 12769, sig: None, sts: 15, class: BounceRunner, slice: 2/3)
>>
>> Machine #3:
>> May 24 15:22:16 2014 (31849) Master qrunner detected subprocess exit
>> (pid: 31858, sig: None, sts: 15, class: VirginRunner, slice: 3/3)
>
> OK, that looks good.
>
>
>> Now for even more strangeness...  After restarting mailman I sent
>> another test message.  Just so you know, my test list has three email
>> addresses in it, so I would expect the messages to get split up
>> generally between the three machines (and please confirm my
>> understanding... if the list has three users on it, each one of the
>> three machines should forward one message to one user from the list?).
>
> No. That's not the way it works. See below.
>
>
>> However after restarting and sending 7 more tests, it seems to bounce
>> between machine #1 and #2 sending the messages.  In each case, one
>> machine sends the message to ALL users.  After waiting about 15 minutes
>> I sent several more test messages.  Now it seems to be randomly picking
>> one of the three machines to send from, but again the copy to all users
>> is sent from that one machine.  I suppose that is better than it was --
>> at least now all three machines are being used.  Is this the way its
>> supposed to be working?
>
> I think so.
>
> Here's the detail. First the general flow.
>
> 1) A post arrives and is queued in the in/ queue.
> 2) It is picked up by IncomingRunner and processed through the handler
> pipeline.
> 3) Assuming it is not held for any reason, it will get queued in the
> archive/ queue for ArchRunner and in the out/ queue for OutgoingRunner.
> It will also be added to the list's digest.mbox for eventually being
> sent to digest members as part of a digest which will be created and
> queued in the virgin/ queue for VirginRunner which will ultimately queue
> it in out/ for delivery
> 4) ArchRunner will pick up the message from the archive/ queue and
> archive it.
> 5) OutgoingRunner will pick up the message from the out/ queue and
> deliver it to the recipients.
>
> Before we look at slicing, we see that once OutgoingRunner has a
> message, it will deliver it to all it's recipients, so a single post
> will always be delivered from the one machine who's OutgoingRunner
> picked it up from the out/ queue.
>
> Now for slicing. Whenever a message is queued, whether for the in/ queue
> by mail delivery or some other queue by some handler or other process,
> it gets a file name of the form tttt+hhhhhhhh.pck. the tttt part is a
> time stamp so we can ensure fifo processing. The hhhhhhhh part is a hex
> digest of a sha1 hash of the message, the listname and the current time.
> Slicing works by dividing that hash space into n equal slices (in your
> case 3 with slice 0 being the first third, slice 1 the middle third and
> slice 2 the last third).
>
> So when a runner that is processing slice 0 say, looks at its queue, it
> will only process those messages in the first third of the hash space.
>
> So bottom line, an incoming message will be queued in the in/ queue and
> it has an equal chance of being in any slice and will be picked up by
> the machine processing that slice. Then the message will be later
> requeued in out/ probably with a different hash. The time is in seconds
> and may or may not have changed, but the message has likely changed due
> to subject prefixing, content filtering and/or header refolding. So it
> will be picked up by the OutgoinRunner processing its slice, and that
> one runner will deliver to all recipients.
>
>
>> Regarding the upgrade version, its been too long, I'm afraid I don't
>> know what the old version was.  The old machines are running ubuntu
>> oneiric and now have mailman 2.1.14.  The newer machines have debian
>> wheezy and mailman 2.1.15.  The upgrades happened a few months back, but
>> I only noticed the issue yesterday because I am trying to get rid of the
>> ubuntu machines and replace them with the debian machines.  The messages
>> have been getting delivered, but apparently one machine was handling
>> everything.
>
> I was curious because it would help me know if there had been relevant
> changes, but I think it's working as it's supposed to and probably as it
> wads before.
>