
Xueshan Feng wrote:
On Mon, Oct 15, 2012 at 9:35 PM, Mark Sapiro <mark@msapiro.net> wrote:
This is really more involved than I can explain without a keyboard which I won't have before Tues eve, but there should be only one .bak file or one per slice if the runner is sliced. This is the message currently being processed. All others are ignored by the current runner (they will be "recovered" if the runner is restarted).
This helps a lot already. We do have multiple runners.
Here are the gory details. All the heavy lifting is done by methods of the Switchboard class defined in Mailman/Queue/Switchbord.py.
Any particular runner is specific to a particular queue or slice of a queue. The out/ queue is processed by OutgoingRunner. If it isn't sliced, it processes the whole queue. If it is sliced, there are N slices.
Note: The filename of a queue entry consists of a time stamp, a '+', a 40 hex digit hash and the extension (.pck or .bak). A slice consists of (1/N)th of the hash space. E.g., if N = 4, slice 0 is all hashes with first hex digit = 0, 1, 2 or 3; slice 1 is all hashes with first hex digit = 4, 5, 6 or 7; slice 2 is all hashes with first hex digit = 8, 9, A or B, and slice 3 is all hashes with first hex digit = C, D, E or F.
A particular slice of OutgoingRunner initializes its Switchboard instance once at startup or restart. This creates the queue directory (qfiles/out/, or whatever queue this runner processes) if necessary, sets the upper and lower hash bounds for its slice if sliced and normally, recovers all the .bak files in it's slice. Recovery consists of incrementing a recovery count in the entry's metadata and renaming it from *.bak to *.pck. Thus, immediately after (re)starting a runner, there will be no *.bak files in its slice. The counter is to stop loops where messages crash the runner. A .bak file will be recovered at most 3 times and then moved to qfiles/bad/*.psv.
After initialization, a runner first obtains a list of all the .pck files in its slice, sorted by timestamp so the list is FIFO. It then processes the list until the list is exhausted, sleeps for a second and gets a new list and repeats the process. If the new list is empty, it just sleeps a second and tries again until it gets one or more entries to process.
Processing consists of renaming the file from *.pck to *.bak, unpickling it and processing it. If it crashes in processing, it will recover the .bak file upon restart. Thus, there should never be more than one .bak file per slice.
Note that part of the slowness at this point is due to the size of the out directory.
I was able to flush the queue today by moving long lasting *.bak out of the way, and at the same time stopped Postfix to allow mailman to process its queue. It took about half an hour to process 8000+ messages. If no manual intervene, it may take a few hours.
You can address this by stopping Mailman, moving qfiles/out aside, starting
Mailman (which should recreate qfiles/out at the first message if not before) and then moving old entries back a few at a time.
I think I've done that before. So moving back files into the queue in batches, doesn't have to stop mailman?
First of all, The actual physical size of the queue directory impacts processing. Every time an entry is added to the queue, and every time a .pck file is renamed to .bak, the entire physical directory must be searched to ensure this isn't a duplicate name. Depending on OS settings, cache sizes and the physical directory size, this may actually involve multiple disk reads each time. Thus, if the qfiles/out/ directory has grown large because 8000+ messages were added to the queue when the runner couldn't handle them (and there may have been more in the retry/ queue because of SMTP failures), it would benefit from shrinking. This is accomplished by moving (mv) or renaming the queue directory itself aside, not just its contents and then letting the runner recreate it when it starts. Then, if necessary, move messages back a few at a time so the directory doesn't grow large again.
The real operational question here is each time if we have to stop / start mailman to move files, than for large volume queues, it would take a lot of manual process. The procedure I have used is:
- stop mailman
- move queue files or .bak file aside
Move the whole directory, not the contents.
- start mailman
- move some files back, or .bak back into the queue (note files are moved back while mailman is running)
Moving (mv or rename) files back from the same file system while Mailman is running is fine. When the entry appears in the directory in this case, the file contents are complete. This is essentially what Mailman does when it makes a queue entry. Copying (cp) is not good because there can be a directory entry for the file before its contents are complete, and a runner could read an incomplete file.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan