Manipulate mailman in / out queue

Is it safe to move files in and out of the mailman's qfile/in, qfile/out directory while the qrunners are running?
We are having an empty 'in' queue, but huge out queue. There might be bad messages stuck somewhere. I saw some posts in past that you can move files to another place, move them back in batches and try to identify which is bad.
Can you do that (drop in files, or move the files out), while the service is running, without crashing service or lost data?
Thanks!
Xueshan
-- Xueshan Feng Infrastructure Delivery Group, IT Services Stanford University

Xueshan Feng <sfeng@stanford.edu> wrote:
Can you do that (drop in files, or move the files out), while the service is running, without crashing service or losing data.
Probably yes, but don't.
Either OutgoingRunner has died (check Mailman's qrunner log) or your out queue is backlogged.
If there is a bad message causing issues, it is in the one out queue entry with a .bak extension. If the queue is backlogged, messages will be processing. Check Mailman's smtp log and the archives of this list.
-- Mark Sapiro <mark@msapiro.net> Sent from my Android phone with K-9 Mail. Please excuse my brevity.

On Mon, Oct 15, 2012 at 2:13 PM, Mark Sapiro <mark@msapiro.net> wrote:
Xueshan Feng <sfeng@stanford.edu> wrote:
Can you do that (drop in files, or move the files out), while the service is running, without crashing service or losing data.
Probably yes, but don't.
Either OutgoingRunner has died (check Mailman's qrunner log) or your out queue is backlogged.
Yes the queue was backlogged because the outgoing smtp server it uses had a service outage. When the queue size climbed to a few thousands, even the smtp service was recovered, the process of the out queue was just really really slow going. (tail smtp log, post log).
If there is a bad message causing issues, it is in the one out queue entry with a .bak extension. If the queue is backlogged, messages will be processing. Check Mailman's smtp log and the archives of this list.
if I want to move quite a few *.bak aside (use timestamp as an indicator of how long they've been in that state), Is it necessary to stop the service, move files, then restart service? We have about 37,000 lists. Sometimes when I try to restart (/etc/init.d/mailman restart), OutgoingRunner won't go away, and had to be killed with -9.
So I was wondering by moving files out of the queue without first stopping mailman, caused the OutgoingRunner to suffer.
Thank you for your quick reply!
Xueshan
-- Mark Sapiro <mark@msapiro.net> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
-- Xueshan Feng Infrastructure Delivery Group, IT Services Stanford University

Xueshan Feng <sfeng@stanford.edu> wrote:
On Mon, Oct 15, 2012 at 2:13 PM, Mark Sapiro <mark@msapiro.net> wrote:
Xueshan Feng <sfeng@stanford.edu> wrote:
if I want to move quite a few *.bak aside (use timestamp as an indicator of how long they've been in that state), Is it necessary to stop the service, move files, then restart service? We have about 37,000 lists. Sometimes when I try to restart (/etc/init.d/mailman restart), OutgoingRunner won't go away, and had to be killed with -9.
This is really more involved than I can explain without a keyboard which I won't have before Tues eve, but there should be only one .bak file or one per slice if the runner is sliced. This is the message currently being processed. All others are ignored by the current runner (they will be "recovered" if the runner is restarted).
So I was wondering by moving files out of the queue without first stopping mailman, caused the OutgoingRunner to suffer.
Probably not, but it is possible. More likely, it couldn't be SIGTERMed because it was waiting for a SMTP response.
Note that part of the slowness at this point is due to the size of the out directory. You can address this by stopping Mailman, moving qfiles/out aside, starting Mailman (which should recreate qfiles/out at the first message if not before) and then moving old entries back a few at a time.
-- Mark Sapiro <mark@msapiro.net> Sent from my Android phone with K-9 Mail. Please excuse my brevity.

On Mon, Oct 15, 2012 at 9:35 PM, Mark Sapiro <mark@msapiro.net> wrote:
Xueshan Feng <sfeng@stanford.edu> wrote:
On Mon, Oct 15, 2012 at 2:13 PM, Mark Sapiro <mark@msapiro.net> wrote:
Xueshan Feng <sfeng@stanford.edu> wrote:
if I want to move quite a few *.bak aside (use timestamp as an indicator of how long they've been in that state), Is it necessary to stop the service, move files, then restart service? We have about 37,000 lists. Sometimes when I try to restart (/etc/init.d/mailman restart), OutgoingRunner won't go away, and had to be killed with -9.
This is really more involved than I can explain without a keyboard which I won't have before Tues eve, but there should be only one .bak file or one per slice if the runner is sliced. This is the message currently being processed. All others are ignored by the current runner (they will be "recovered" if the runner is restarted).
This helps a lot already. We do have multiple runners.
So I was wondering by moving files out of the queue without first stopping mailman, caused the OutgoingRunner to suffer.
Probably not, but it is possible. More likely, it couldn't be SIGTERMed because it was waiting for a SMTP response.
Make sense.
Note that part of the slowness at this point is due to the size of the out directory.
I was able to flush the queue today by moving long lasting *.bak out of the way, and at the same time stopped Postfix to allow mailman to process its queue. It took about half an hour to process 8000+ messages. If no manual intervene, it may take a few hours.
You can address this by stopping Mailman, moving qfiles/out aside, starting
Mailman (which should recreate qfiles/out at the first message if not before) and then moving old entries back a few at a time.
I think I've done that before. So moving back files into the queue in batches, doesn't have to stop mailman?
The real operational question here is each time if we have to stop / start mailman to move files, than for large volume queues, it would take a lot of manual process. The procedure I have used is:
- stop mailman
- move queue files or .bak file aside
- start mailman
- move some files back, or .bak back into the queue (note files are moved back while mailman is running)
Sounds right? thank you so much for your help!
Xueshan
-- Mark Sapiro <mark@msapiro.net> Sent from my Android phone with K-9 Mail. Please excuse my brevity.
-- Xueshan Feng Infrastructure Delivery Group, IT Services Stanford University

Xueshan Feng wrote:
On Mon, Oct 15, 2012 at 9:35 PM, Mark Sapiro <mark@msapiro.net> wrote:
This is really more involved than I can explain without a keyboard which I won't have before Tues eve, but there should be only one .bak file or one per slice if the runner is sliced. This is the message currently being processed. All others are ignored by the current runner (they will be "recovered" if the runner is restarted).
This helps a lot already. We do have multiple runners.
Here are the gory details. All the heavy lifting is done by methods of the Switchboard class defined in Mailman/Queue/Switchbord.py.
Any particular runner is specific to a particular queue or slice of a queue. The out/ queue is processed by OutgoingRunner. If it isn't sliced, it processes the whole queue. If it is sliced, there are N slices.
Note: The filename of a queue entry consists of a time stamp, a '+', a 40 hex digit hash and the extension (.pck or .bak). A slice consists of (1/N)th of the hash space. E.g., if N = 4, slice 0 is all hashes with first hex digit = 0, 1, 2 or 3; slice 1 is all hashes with first hex digit = 4, 5, 6 or 7; slice 2 is all hashes with first hex digit = 8, 9, A or B, and slice 3 is all hashes with first hex digit = C, D, E or F.
A particular slice of OutgoingRunner initializes its Switchboard instance once at startup or restart. This creates the queue directory (qfiles/out/, or whatever queue this runner processes) if necessary, sets the upper and lower hash bounds for its slice if sliced and normally, recovers all the .bak files in it's slice. Recovery consists of incrementing a recovery count in the entry's metadata and renaming it from *.bak to *.pck. Thus, immediately after (re)starting a runner, there will be no *.bak files in its slice. The counter is to stop loops where messages crash the runner. A .bak file will be recovered at most 3 times and then moved to qfiles/bad/*.psv.
After initialization, a runner first obtains a list of all the .pck files in its slice, sorted by timestamp so the list is FIFO. It then processes the list until the list is exhausted, sleeps for a second and gets a new list and repeats the process. If the new list is empty, it just sleeps a second and tries again until it gets one or more entries to process.
Processing consists of renaming the file from *.pck to *.bak, unpickling it and processing it. If it crashes in processing, it will recover the .bak file upon restart. Thus, there should never be more than one .bak file per slice.
Note that part of the slowness at this point is due to the size of the out directory.
I was able to flush the queue today by moving long lasting *.bak out of the way, and at the same time stopped Postfix to allow mailman to process its queue. It took about half an hour to process 8000+ messages. If no manual intervene, it may take a few hours.
You can address this by stopping Mailman, moving qfiles/out aside, starting
Mailman (which should recreate qfiles/out at the first message if not before) and then moving old entries back a few at a time.
I think I've done that before. So moving back files into the queue in batches, doesn't have to stop mailman?
First of all, The actual physical size of the queue directory impacts processing. Every time an entry is added to the queue, and every time a .pck file is renamed to .bak, the entire physical directory must be searched to ensure this isn't a duplicate name. Depending on OS settings, cache sizes and the physical directory size, this may actually involve multiple disk reads each time. Thus, if the qfiles/out/ directory has grown large because 8000+ messages were added to the queue when the runner couldn't handle them (and there may have been more in the retry/ queue because of SMTP failures), it would benefit from shrinking. This is accomplished by moving (mv) or renaming the queue directory itself aside, not just its contents and then letting the runner recreate it when it starts. Then, if necessary, move messages back a few at a time so the directory doesn't grow large again.
The real operational question here is each time if we have to stop / start mailman to move files, than for large volume queues, it would take a lot of manual process. The procedure I have used is:
- stop mailman
- move queue files or .bak file aside
Move the whole directory, not the contents.
- start mailman
- move some files back, or .bak back into the queue (note files are moved back while mailman is running)
Moving (mv or rename) files back from the same file system while Mailman is running is fine. When the entry appears in the directory in this case, the file contents are complete. This is essentially what Mailman does when it makes a queue entry. Copying (cp) is not good because there can be a directory entry for the file before its contents are complete, and a runner could read an incomplete file.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Mark Sapiro wrote:
Xueshan Feng wrote:
The real operational question here is each time if we have to stop / start mailman to move files, than for large volume queues, it would take a lot of manual process. The procedure I have used is:
- stop mailman
- move queue files or .bak file aside
Move the whole directory, not the contents.
- start mailman
- move some files back, or .bak back into the queue (note files are moved back while mailman is running)
It's implied by the rest of my reply, but moving a .bak file into the queue while the runner for that slice is running does nothing until that runner is stopped or crashes and is restarted. If you want to actually process a .bak file you've moved aside, rename it .pck before moving it back.
Note that you can examine the messages in queue entries with Mailman's bin/show_qfiles and see the messages and metadata with Mailman's bin/dumpdb. This may help in deciding whether to reprocess a particular entry. But, in your case, where the backlog in processing was due to an MTA outage, all the entries should be good.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Hi Mark,
Thank you so much for taking the time to explain the details of queue process, the meaning of the file names in the queue, and what's the right way to handle files in the queue (like mv vs. cp).
I have been managing campus mailing service for a long time. Once in a while we got mailbomb and messages would got stuck in queue and took long time to drain. I have tired moves files, move dirs, with / without restart service. Sometimes the queue would get processed quickly after my intervene, sometimes it did not. I also worry about losing messages by manually messing with the queue files.
It helps a lot if one understands how things work. No more trial and error next time if I have to handle backlogs! Thank you!
Xueshan
Xueshan
On Tue, Oct 16, 2012 at 9:23 PM, Mark Sapiro <mark@msapiro.net> wrote:
Xueshan Feng wrote:
On Mon, Oct 15, 2012 at 9:35 PM, Mark Sapiro <mark@msapiro.net> wrote:
This is really more involved than I can explain without a keyboard
which I
won't have before Tues eve, but there should be only one .bak file or one per slice if the runner is sliced. This is the message currently being processed. All others are ignored by the current runner (they will be "recovered" if the runner is restarted).
This helps a lot already. We do have multiple runners.
Here are the gory details. All the heavy lifting is done by methods of the Switchboard class defined in Mailman/Queue/Switchbord.py.
Any particular runner is specific to a particular queue or slice of a queue. The out/ queue is processed by OutgoingRunner. If it isn't sliced, it processes the whole queue. If it is sliced, there are N slices.
Note: The filename of a queue entry consists of a time stamp, a '+', a 40 hex digit hash and the extension (.pck or .bak). A slice consists of (1/N)th of the hash space. E.g., if N = 4, slice 0 is all hashes with first hex digit = 0, 1, 2 or 3; slice 1 is all hashes with first hex digit = 4, 5, 6 or 7; slice 2 is all hashes with first hex digit = 8, 9, A or B, and slice 3 is all hashes with first hex digit = C, D, E or F.
A particular slice of OutgoingRunner initializes its Switchboard instance once at startup or restart. This creates the queue directory (qfiles/out/, or whatever queue this runner processes) if necessary, sets the upper and lower hash bounds for its slice if sliced and normally, recovers all the .bak files in it's slice. Recovery consists of incrementing a recovery count in the entry's metadata and renaming it from *.bak to *.pck. Thus, immediately after (re)starting a runner, there will be no *.bak files in its slice. The counter is to stop loops where messages crash the runner. A .bak file will be recovered at most 3 times and then moved to qfiles/bad/*.psv.
After initialization, a runner first obtains a list of all the .pck files in its slice, sorted by timestamp so the list is FIFO. It then processes the list until the list is exhausted, sleeps for a second and gets a new list and repeats the process. If the new list is empty, it just sleeps a second and tries again until it gets one or more entries to process.
Processing consists of renaming the file from *.pck to *.bak, unpickling it and processing it. If it crashes in processing, it will recover the .bak file upon restart. Thus, there should never be more than one .bak file per slice.
Note that part of the slowness at this point is due to the size of the out directory.
I was able to flush the queue today by moving long lasting *.bak out of the way, and at the same time stopped Postfix to allow mailman to process its queue. It took about half an hour to process 8000+ messages. If no manual intervene, it may take a few hours.
You can address this by stopping Mailman, moving qfiles/out aside, starting
Mailman (which should recreate qfiles/out at the first message if not before) and then moving old entries back a few at a time.
I think I've done that before. So moving back files into the queue in batches, doesn't have to stop mailman?
First of all, The actual physical size of the queue directory impacts processing. Every time an entry is added to the queue, and every time a .pck file is renamed to .bak, the entire physical directory must be searched to ensure this isn't a duplicate name. Depending on OS settings, cache sizes and the physical directory size, this may actually involve multiple disk reads each time. Thus, if the qfiles/out/ directory has grown large because 8000+ messages were added to the queue when the runner couldn't handle them (and there may have been more in the retry/ queue because of SMTP failures), it would benefit from shrinking. This is accomplished by moving (mv) or renaming the queue directory itself aside, not just its contents and then letting the runner recreate it when it starts. Then, if necessary, move messages back a few at a time so the directory doesn't grow large again.
The real operational question here is each time if we have to stop / start mailman to move files, than for large volume queues, it would take a lot of manual process. The procedure I have used is:
- stop mailman
- move queue files or .bak file aside
Move the whole directory, not the contents.
- start mailman
- move some files back, or .bak back into the queue (note files are moved back while mailman is running)
Moving (mv or rename) files back from the same file system while Mailman is running is fine. When the entry appears in the directory in this case, the file contents are complete. This is essentially what Mailman does when it makes a queue entry. Copying (cp) is not good because there can be a directory entry for the file before its contents are complete, and a runner could read an incomplete file.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
-- Xueshan Feng Infrastructure Delivery Group, IT Services Stanford University
participants (2)
-
Mark Sapiro
-
Xueshan Feng