Problems if shunting fails
data:image/s3,"s3://crabby-images/92078/920789fca9c5f85bcff835faa6ab7bec03f2f165" alt=""
There is a bug report at <https://sourceforge.net/tracker/index.php?func=detail&aid=1656289&group_id=103&atid=100103>.
The basic issue here is there is an extremely large message in the 'in' queue. In followup emails with the submitter, he said "The file was nearly half a terabyte in size".
Not surprisingly, IncomingRunner threw a MemoryError when SpamDetect tried to flatten() the message, but then Runner tried to shunt the message and Switchboard.enqueue() threw another MemoryError in the attempt to pickle the new shunt queue entry.
The second MemoryError is uncaught in Runner._oneloop, so it causes the main loop in Runner.run to exit.
Pre-Mailman 2.1.9, the master just restarts IncomingRunner - the message is lost, but everything else is OK.
Because of the changes in 2.1.9 to prevent message loss in case of disaster, there is now a .bak file left in the 'in' queue. When IncomingRunner restarts, it recovers the .bak file and the whole scenario repeats until the master reaches MAX_RESTARTS on IncomingRunner and we are left with no IncomingRunner and the .bak file still in the 'in' queue.
In order to fix this, I suggest we protect the shunt enqueue in a try. I have worked up a patch for this which is attached. This patch also adds a 'preserve' argument to Switchboard.finish such that if it is called with preserve=True, instead of removing the .bak file, it just renames it with a .psv extension. These changes ensure that IncomingRunner doesn't exit, and no .bak file is left to cause a subsequent problem while still preserving the original queue entry for further analysis if possible.
I would like some feedback on whether or not this is the right approach.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
data:image/s3,"s3://crabby-images/92078/920789fca9c5f85bcff835faa6ab7bec03f2f165" alt=""
Mark Sapiro wrote:
Since posting the above a couple of weeks ago, Another situation has come up on mailman-users with a different scenario. The thread is at <http://mail.python.org/pipermail/mailman-users/2007-February/055809.html>. In this case, the problem was an unparseable message, and part of the issue was due to the user not having email 2.5.8 installed in pythonlib. Correcting this avoided the problem in her case, but there is still an underlying issue here in that the only exception that we catch is email.Errors.MessageParseError, and at least in this case, email.message_from_string threw a ValueError.
The other part of this, is we used to just give up on this message when the exception occurred as we didn't have anything else we could do. Starting with Mailman 2.1.9, we still have the .bak queue entry which can be 'preserved' as above. Thus, I have modified the previous suggested patch, and a new patch is attached which catches any exception from switchboard.dequeue() and logs and preserves the queue entry.
Thus, with this patch, we have two different ways in which a queue entry can be 'preserved' for analysis. One is when dequeue() throws an exception and the other is when the attempt to enqueue() in the shunt queue throws an exception.
The question I would like to discuss, is what is the best way to preserve the queue entry for analysis. The patch just renames the .bak file to .psv and leaves it in the original queue. This could potentially over time accumulate a lot of .psv files in the 'in' or other queues and impact processing.
We can't shunt the entry in the normal way because in some cases at least, shunting has already thrown an exception. I can think of three things to do.
- Just rename the entry .psv and leave it in the original queue.
- Rename it .psv and attempt to move it to the shunt queue.
- Rename it .psv and attempt to move it to the bad queue.
I would like to get other people's thoughts on this.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
data:image/s3,"s3://crabby-images/50535/5053512c679a1bec3b1143c853c1feacdabaee83" alt=""
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi Mark,
I'm sorry I haven't been able to respond to this thread before now,
but I've been traveling and PyCon'ing and either haven't had time or
have had spotty access to email.
On Feb 22, 2007, at 5:24 PM, Mark Sapiro wrote:
I'm pretty sure we don't want to leave the .psv file in the original
queue. I agree that over time, we'll just end up filling up the
queue directory with files we'll never process, wasting time listing
them, or worse. A major reason for having the shunt queue in the
first place was to not clutter up our processing queues with messages
we couldn't do anything about. Of course, culling our shunt queue is
another issues. :/
OTOH, if we're going to move the offending message to the shunt
queue, then there's not much point in keeping the .psv extension. We
pretty much always know that if the file is in shunt, it's bad, so
maybe we just rename the offender to shunt/blah.txt (assuming it's an
unparsed text file).
The other thing to consider is adding a configuration variable that
let's us limit the size of the files we'll handle. I can't imagine
any scenario under which we'd want to (let alone be able to) handle a
message of a half terabyte. Heck, you have to have a pretty
misconfigured mail system to even allow such a message to get to
Mailman, IMO. I think Postfix for example has a 10MB default size
limit, and even cranking that up by a factor of 10 should allow most
legitimate mail to go through.
So, if we added a size limit configuration variable, we'd have to
stat the file (os.path.getsize()) and just os.rename() the file to
shunt (with a log message) when we see a file over that size limit.
What do you think?
- -Barry
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin)
iQCVAwUBReJo43EjvBPtnXfVAQKw8QP/Qm2twCuIkyQV4zXt1+Ei9PKoBpdBSiK9 G/LgwlUz8OCBRkonb+m4NTyMmZJTQUjqcLICvgFzZ9d+dxf5ekVA8KoXM1iJjjnE TQqS88CAgUYcWixkIfS5H1l0lVSKSGsUn8eUAXwmoVZVn3NaqCzhTo1wGV56KATc VesBGYhpGkQ= =juvu -----END PGP SIGNATURE-----
data:image/s3,"s3://crabby-images/92078/920789fca9c5f85bcff835faa6ab7bec03f2f165" alt=""
Barry Warsaw wrote:
That's OK. I almost went to PyCon myself this year, but didn't make it. Maybe next time.
I agree that it is not a good thing to leave these in the original queue. We should somehow move them elsewhere. The only reason for even thinking about leaving them in the original queue is that a rename, changing the extension only is almost guaranteed not to fail whereas a rename that changes the path could run into a permissions problem or a non-existant directory. I don't think these are significant though as the non-existant directory can be created, and permissions aren't likely to be an issue if Mailman is actually running.
I'm not so sure about this one. The file may not be unparsed text, and we may not have the metadata to find out. In any case, we don't want to put a .bak file in the shunt queue because unshunt will 'recover' it. I don't think we want that.
I think the size limit is a good idea. I think it would have avoided the MemoryError that started me off on this in the first place.
I do think, we also need to consider the possible exceptions from dequeue(). Now that we have a .bak file in the case of a MessageParseError exception, I think it is good to try to put it somewhere where it can be examined if desired. Also, we had a case of ValueError being thrown by message_from_string. Of course, the ValueError was the result of an email bug that's been fixed, but it may still be possible for message_from_string to throw exceptions other than MessageParseError, and I think we need to protect against that.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
data:image/s3,"s3://crabby-images/92078/920789fca9c5f85bcff835faa6ab7bec03f2f165" alt=""
Mark Sapiro wrote:
Since posting the above a couple of weeks ago, Another situation has come up on mailman-users with a different scenario. The thread is at <http://mail.python.org/pipermail/mailman-users/2007-February/055809.html>. In this case, the problem was an unparseable message, and part of the issue was due to the user not having email 2.5.8 installed in pythonlib. Correcting this avoided the problem in her case, but there is still an underlying issue here in that the only exception that we catch is email.Errors.MessageParseError, and at least in this case, email.message_from_string threw a ValueError.
The other part of this, is we used to just give up on this message when the exception occurred as we didn't have anything else we could do. Starting with Mailman 2.1.9, we still have the .bak queue entry which can be 'preserved' as above. Thus, I have modified the previous suggested patch, and a new patch is attached which catches any exception from switchboard.dequeue() and logs and preserves the queue entry.
Thus, with this patch, we have two different ways in which a queue entry can be 'preserved' for analysis. One is when dequeue() throws an exception and the other is when the attempt to enqueue() in the shunt queue throws an exception.
The question I would like to discuss, is what is the best way to preserve the queue entry for analysis. The patch just renames the .bak file to .psv and leaves it in the original queue. This could potentially over time accumulate a lot of .psv files in the 'in' or other queues and impact processing.
We can't shunt the entry in the normal way because in some cases at least, shunting has already thrown an exception. I can think of three things to do.
- Just rename the entry .psv and leave it in the original queue.
- Rename it .psv and attempt to move it to the shunt queue.
- Rename it .psv and attempt to move it to the bad queue.
I would like to get other people's thoughts on this.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
data:image/s3,"s3://crabby-images/50535/5053512c679a1bec3b1143c853c1feacdabaee83" alt=""
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi Mark,
I'm sorry I haven't been able to respond to this thread before now,
but I've been traveling and PyCon'ing and either haven't had time or
have had spotty access to email.
On Feb 22, 2007, at 5:24 PM, Mark Sapiro wrote:
I'm pretty sure we don't want to leave the .psv file in the original
queue. I agree that over time, we'll just end up filling up the
queue directory with files we'll never process, wasting time listing
them, or worse. A major reason for having the shunt queue in the
first place was to not clutter up our processing queues with messages
we couldn't do anything about. Of course, culling our shunt queue is
another issues. :/
OTOH, if we're going to move the offending message to the shunt
queue, then there's not much point in keeping the .psv extension. We
pretty much always know that if the file is in shunt, it's bad, so
maybe we just rename the offender to shunt/blah.txt (assuming it's an
unparsed text file).
The other thing to consider is adding a configuration variable that
let's us limit the size of the files we'll handle. I can't imagine
any scenario under which we'd want to (let alone be able to) handle a
message of a half terabyte. Heck, you have to have a pretty
misconfigured mail system to even allow such a message to get to
Mailman, IMO. I think Postfix for example has a 10MB default size
limit, and even cranking that up by a factor of 10 should allow most
legitimate mail to go through.
So, if we added a size limit configuration variable, we'd have to
stat the file (os.path.getsize()) and just os.rename() the file to
shunt (with a log message) when we see a file over that size limit.
What do you think?
- -Barry
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (Darwin)
iQCVAwUBReJo43EjvBPtnXfVAQKw8QP/Qm2twCuIkyQV4zXt1+Ei9PKoBpdBSiK9 G/LgwlUz8OCBRkonb+m4NTyMmZJTQUjqcLICvgFzZ9d+dxf5ekVA8KoXM1iJjjnE TQqS88CAgUYcWixkIfS5H1l0lVSKSGsUn8eUAXwmoVZVn3NaqCzhTo1wGV56KATc VesBGYhpGkQ= =juvu -----END PGP SIGNATURE-----
data:image/s3,"s3://crabby-images/92078/920789fca9c5f85bcff835faa6ab7bec03f2f165" alt=""
Barry Warsaw wrote:
That's OK. I almost went to PyCon myself this year, but didn't make it. Maybe next time.
I agree that it is not a good thing to leave these in the original queue. We should somehow move them elsewhere. The only reason for even thinking about leaving them in the original queue is that a rename, changing the extension only is almost guaranteed not to fail whereas a rename that changes the path could run into a permissions problem or a non-existant directory. I don't think these are significant though as the non-existant directory can be created, and permissions aren't likely to be an issue if Mailman is actually running.
I'm not so sure about this one. The file may not be unparsed text, and we may not have the metadata to find out. In any case, we don't want to put a .bak file in the shunt queue because unshunt will 'recover' it. I don't think we want that.
I think the size limit is a good idea. I think it would have avoided the MemoryError that started me off on this in the first place.
I do think, we also need to consider the possible exceptions from dequeue(). Now that we have a .bak file in the case of a MessageParseError exception, I think it is good to try to put it somewhere where it can be examined if desired. Also, we had a case of ValueError being thrown by message_from_string. Of course, the ValueError was the result of an email bug that's been fixed, but it may still be possible for message_from_string to throw exceptions other than MessageParseError, and I think we need to protect against that.
-- Mark Sapiro <msapiro@value.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Barry Warsaw
-
Mark Sapiro