
My configuration: Mailman: 2.1.14 OS: Solaris 10 Python: 2.4.5 PREFIX = '/usr/mailman' Server setup: 1 server for web management, 1 server for MTA/qrunner. /usr/mailman is NFS mounted on both servers
I've been having the following issue my mailman lists:
A user is either subscribed or unsubscribed according to the logs, but then if I look at the member list, the action has not been done (or has been undone). For example, here is where I remove a subscriber and then look at the list members and they are still in the list:
[mailman@myhost] ~/logs |> grep testlist subscribe | grep acase Sep 28 17:15:14 2011 (4401) testlist: new acase@example.com, admin mass sub Sep 28 17:19:36 2011 (5821) testlist: deleted acase@example.com; member mgt page [mailman@myhost] ~/logs |> ../bin/list_members testlist | grep acase acase@example.com [mailman@myhost] ~/logs |>
The same also happens when subscribing. I will mass subscribe users (or when users confirm subscription via email/web), the logs indicated that they have been subscribed successfully, but then when I go look them up, they are not listed on the members list.
This happens sporadically, but I am generally able to reproduce the error if I do it a couple times in a row.
I'm suspicious there may be a locking issue and config.pck is reverting to config.pck.last. I found this thread rather helpful in analyzing potential problems, but I have yet to figure anything out: http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD
In addition if I just run the following commands over and over, then the bug never seems to come up. This is part of why I am worrying about locking: bin/add_members ... bin/remove_members ...
Is there a good way to test locking between servers? I've run the tests/test_lockfile.py, but it reports it is OK.
Any and all help would be GREATLY appreciated. We've been trying to triage this bug for weeks and it is terribly disruptive for our users.
Thanks,
-- Drew

On 9/28/2011 3:08 PM, Andrew Case wrote:
There is a bug in the Mailman 2.1 branch, but the above is not it. The above log shows that acase@example.com was added by admin mass subscribe at 17:15:14 and then a bit more than 4 minutes later, was removed by checking the unsub box on the admin Membership List and submitting.
If you check your web server logs, you will find POST transactions to the admin page for both these events.
This is possibly a manifestation of the bug, but I'm surprised it is happening that frequently.
The thread you point to above is relevant, but it is not a locking issue. The problem is due to list caching in Mailman/Queue/Runner.py and/or nearly concurrent processes which first load the list unlocked and later lock it. The issue is that the resolution of the config.pck timestamp is 1 second, and if a process has a list object and that list object is updated by another process within the same second as the timestamp on the first process's object, the first process won't load the updated list when it locks it. This can result in things like a subscribe being done and logged and then silently reversed.
List locking is working as it should. The issue is that the first process doesn't reload the updated list when it acquires the lock because it thinks it already has the latest version.
I thought I had fixed this on the 2.1 branch, but it seems I only fixed it for the now defunct 2.2 branch.
A relevant thread starts at <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html> and continues at <http://mail.python.org/pipermail/mailman-developers/2008-August/020329.html>
The patch in the attached cache.patch file should fix it.
That won't do it. bin/add_members alone will do it, but only if there is a nearly concurrent process updating the same list.
The post at <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html> contains a "stress test" that will probably reproduce the problem.
I suspect your Mailman server must be very busy for you to see this bug that frequently. However, it looks like I need to install the fix for Mailman 2.1.15.
It is also curious that the only reports of this that I can recall both come from solaris users. There may be complications in your case due to NFS, but locking shouldn't be the issue. Run the stress test and see if it fails. If it does, try the patch.
Let us know what happens.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Thanks Mark, see inline comments.
I was trying to show that even after the user was removed, they're still listed as a member.
If you check your web server logs, you will find POST transactions to the admin page for both these events.
Agreed.
Easiest way for me to replicated the problem is:
- check the unsubscribe box for user A then hit submit
- after reload check the unsubscribe box for user B then hit submit
- reaload the "membership list" page and user B is back on the list
This happens even after I wait a couple seconds in between each step.
The result sounds the same, but would this happen even if I'm loading the page with more than a second in between each step outlined above?
I applied the patch but it doesn't seem to have made a difference.
Correct. Only one subscriber was subscribed to each test list. Keep in mind that in the stress test given if you use a sleep counter of 5 with 6 lists, that means you're waiting _30 seconds_ before the next add_member command is run for that list (I'm assume the timing issue is per-list, not per run of add_members). Even if you set the timer down to 1 that's a 6 second sleep. This shouldn't effect a cache that we're comparing for the given second. Anyway, my script ran fine with the 5 second sleep (30 seconds per list add), but showed discrepancies with a 3 second sleep.
We run about 600 different mailing lists for our department and this has been a continues headache. I appreciate all the hard work you guys do.
Patch didn't seem to help. Is there an easy way to omit the caching in this?
Thanks,
Drew

Mark,
I realized I hadn't restarted my QRunners after this patch. It looks like its working perfectly now! Even with a sleep of 0. Thanks so much!
-- Drew
On Thu, September 29, 2011 2:52 am, Andrew Case wrote:
Andrew Case Systems Administrator Courant Institute of Mathematical Sciences New York University 251 Mercer St., Room 1023 New York, NY 10012-1110 Phone: 212-998-3147

Mark,
I don't know if you'll find this worth trying to fix, but I revised the stress test this morning to be a bit more stressful. The biggest change I made (besides upping the number of subscribers) was that instead of adding each member to all the lists before adding the second member, it now adds all the members to one list successively before moving on the the next list. This causes a much lower amount of time between subscriptions per list, causing the config.pck files to be updated in a much smaller timeframe.
When I did this I saw that maybe 1-5% of the time a user was still omitted from the list (user was silently removed). I think that because these are processed by the queue runner on a different host and because the timestamp check is being done on an NFS stored file, there is potential that the qrunner for this doesn't yet have an updated mtime for that file (or even a small ntp time drift could cause this). When I commented out the caching part of the code in MailList.py this bug never seems to show up: #if mtime < self.__timestamp: # # File is not newer # return None, None
So I think there may still be a race condition here, but the chances of it are unlikely that human interaction would trigger this. If however, you have a script that is subscribing users (one after another), this could still come up. I actually happen to have such a script, but I run it on the same host as the qrunners, so I haven't experienced this before.
In my case I think it's probably not worth keeping the performance gain that the caching adds for sake of consistency.
Attached is the modified stress test I'm using.
Thanks again,
-- Drew
On Thu, September 29, 2011 4:13 am, Andrew Case wrote:
Andrew Case Systems Administrator Courant Institute of Mathematical Sciences New York University 251 Mercer St., Room 1023 New York, NY 10012-1110 Phone: 212-998-3147

On 9/29/2011 8:58 AM, Andrew Case wrote:
Actually, that is not the cache. It is just the test for whether the current list object, cached or whatever, needs to be reloaded from disk.
I think that your configuration with NFS and possible time jitter between servers makes the bug more likely.
It can happen even where everything is on a single host, but as I said, I think your configuration makes it more likely.
In my case I think it's probably not worth keeping the performance gain that the caching adds for sake of consistency.
Attached is a patch to remove list caching from the qrunners. This patch has the additional advantage of limiting the growth of the qrunners over time. Old entries were supposed to be freed from the cache, but a self reference in the default MemberAdaptor prevented this from occurring.
For reasons of trying not to be disruptive this patch and the bug fix I sent earlier were never applied to the 2.1 branch. I think this was a mistake, and I will apply them for Mailman 2.1.15.
Attached is the modified stress test I'm using.
Thanks.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Hey Mark,
Take a look at these results that are with the 2 patches you've sent me (and qrunners restarted). I'm adding 10 users to 6 lists:
[... creating list output cut ...]
- Subscribing to testlist1 Subscribed: foo1@www.cs.nyu.edu Subscribed: foo2@www.cs.nyu.edu Subscribed: foo3@www.cs.nyu.edu Subscribed: foo4@www.cs.nyu.edu Subscribed: foo5@www.cs.nyu.edu Subscribed: foo6@www.cs.nyu.edu Subscribed: foo7@www.cs.nyu.edu Subscribed: foo8@www.cs.nyu.edu Subscribed: foo9@www.cs.nyu.edu Subscribed: foo10@www.cs.nyu.edu
- Subscribing to testlist2 Subscribed: foo1@www.cs.nyu.edu Subscribed: foo2@www.cs.nyu.edu Subscribed: foo3@www.cs.nyu.edu Traceback (most recent call last): File "/usr/mailman/bin/add_members", line 258, in <module> main() File "/usr/mailman/bin/add_members", line 238, in main addall(mlist, nmembers, 0, send_welcome_msg, s) File "/usr/mailman/bin/add_members", line 135, in addall mlist.ApprovedAddMember(userdesc, ack, 0) File "/usr/mailman/Mailman/MailList.py", line 948, in ApprovedAddMember assert self.Locked() AssertionError Subscribed: foo5@www.cs.nyu.edu Subscribed: foo6@www.cs.nyu.edu Subscribed: foo7@www.cs.nyu.edu Subscribed: foo8@www.cs.nyu.edu Subscribed: foo9@www.cs.nyu.edu Subscribed: foo10@www.cs.nyu.edu
- Subscribing to testlist3 Subscribed: foo1@www.cs.nyu.edu
[... subscribing users to all other lists went fine ...]
- Subscribers for testlist1: foo10@www.cs.nyu.edu foo1@www.cs.nyu.edu foo2@www.cs.nyu.edu foo3@www.cs.nyu.edu foo4@www.cs.nyu.edu foo5@www.cs.nyu.edu foo6@www.cs.nyu.edu foo7@www.cs.nyu.edu foo8@www.cs.nyu.edu foo9@www.cs.nyu.edu
- Removing list testlist1
- Subscribers for testlist2: foo1@www.cs.nyu.edu foo2@www.cs.nyu.edu foo3@www.cs.nyu.edu foo5@www.cs.nyu.edu foo6@www.cs.nyu.edu foo7@www.cs.nyu.edu foo8@www.cs.nyu.edu foo9@www.cs.nyu.edu
- Removing list testlist2
[... the rest were all fine ...]
There was a locking issue with testlist2 foo4, which is fine since it doesn't report back as successful. But you'll also notice that foo10 wasn't listed as a subscriber even though it appeared as though that subscribe was successful.
Here's some errors on the very next run where I'm subscribing 10 people to each list as well:
[... cut expected results ...]
- Subscribers for testlist4: foo10@www.cs.nyu.edu foo1@www.cs.nyu.edu foo2@www.cs.nyu.edu foo3@www.cs.nyu.edu foo4@www.cs.nyu.edu foo5@www.cs.nyu.edu foo6@www.cs.nyu.edu [** no foo8 **] foo7@www.cs.nyu.edu foo9@www.cs.nyu.edu
- Removing list testlist4
- Subscribers for testlist5: foo10@www.cs.nyu.edu foo1@www.cs.nyu.edu foo2@www.cs.nyu.edu foo3@www.cs.nyu.edu foo4@www.cs.nyu.edu foo5@www.cs.nyu.edu foo6@www.cs.nyu.edu foo7@www.cs.nyu.edu [** no foo8 or foo9 **]
- Removing list testlist5 [... cut expected results ...]
That's my (1-5%) failure. But when I also comment out the following:
It seems to work each time (I ran 3 tests in a row, all with expected results).
Let me know what you think.
Thanks,
-- Drew
On Thu, September 29, 2011 1:30 pm, Mark Sapiro wrote:
Andrew Case Systems Administrator Courant Institute of Mathematical Sciences New York University 251 Mercer St., Room 1023 New York, NY 10012-1110 Phone: 212-998-3147

Andrew Case wrote:
The above is definitely a problem, but I can't see how it can occur unless there is some race condition at the level of the file system.
[...]
Yes, I see that and both are cause for concern. I am more concerned about the AssertionError with testlist2 foo4, because I can't see how that can happen without some file system anomaly.
The code above is a decision about whether we need to reload the list object from the file system based on the file system time stamp vs. our internal time stamp.
The original code said "if mtime <= self.__timestamp:". Since these time stamps are in whole seconds, that test (to skip loading) could succeed even if the file time stamp was a fraction of a second newer than the internal time stamp. Thus, the bug.
The fix is to make the test "if mtime < self.__timestamp:" meanin we only skip loading if the file time stamp is strictly less than the internal time stamp, but in your case, if the clock on the MTA/qrunner machine is a bit faster than that on the machine running the script, the internal time stamp of the qrunner process could be a second ahead of the file time stamp set by the add_members process even though on an absolute scale it is older.
So, I think we're concerned here about clock skew between the machines, and in that case, commenting out the code completely as you have done makes sense.
I have filed bug reports at <https://bugs.launchpad.net/mailman/+bug/862675> and <https://bugs.launchpad.net/mailman/+bug/862683> about these issues in preparation to fixing them.
Now I don't know whether to stick with the "if mtime < self.__timestamp:" test which will work on a single server or to reload unconditionally as you have done which seems to be necessary in a shared file system situation with possible time skews.
As far as the AssertionError is concerned, I don't know what to make of it. It appears to be a file system issue outside of Mailman, so I don't know how to deal with it. I think the code in Mailman's LockFile module is correct. If you are willing to try debugging this further, you could set
LIST_LOCK_DEBUGGING = True
in mm_cfg.py and restart your qrunners and try to reproduce the exception. This will log copious information to Mailman's 'locks' log which may help to understand what happened.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 9/28/2011 11:52 PM, Andrew Case wrote:
Sorry, I missed that. You are correct and this does appear to be a manifestation of the same issue.
[...]
It is tricky. Each add_members, remove_members and web CGI post is a separate process. If these processes are run sequentially, there should not be any problem because each process will load the list, lock it update it and save it before the next process loads it.
The problem occurs when processes run concurrently. The scenario is process A loads the list unlocked; process B locks the list and updates it; process A tries to lock the list and gets the lock after process B relinquishes it; if the timestamp on the config.pck from process B's update is in the same second as the timestamp of process A's initial load, process A thinks the list hasn't been updated and doesn't reload it after obtaining the lock. Thus, when process A saves the list, process B's changes are reversed.
This is complicated by list caching in the qrunners because each qrunner may have a cached copy of the list, so it can act as process A in the above scenario with its cached copy playing the role of the initially loaded list. To complicate this further, the qrunners get involved even in the simple scenario with sequential commands because add_members, remove_members and CGIs result in notices being sent, and the qrunner processes that send the notices are running concurrently. This is why the stress test will fail even though commands are run sequentially.
[...]
I applied the patch but it doesn't seem to have made a difference.
As you later report, restarting the qrunners did seem to fix it.
[...]
So you are adding 'sleep' commands after each add_members? I'm not sure what you're doing. Is there a different test elsewhere in the thread?
I have used a couple of tests as attached. They are the same except for list order and are very similar to the one in the original thread. Note that they contain only one sleep after all the add_members just to allow things to settle before running list_members.
Actually, I don't think the issue is the busy server. I think it is more likely that NFS causes timing issues between add_members and VirginRunner and OutgoingRunner that just make the bug more likely to trigger.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

[...]
Thank you for that explanation. I did seem to have confusion as to when the qrunners cache and/or update these config.pck files and when the add/remove_members commands did as well. There seemed to be some sort of conflict between the two.
[...]
Yes I was. Without a sleep in between add_member calls, it was failing for ~50% of the calls to add_members. With a 5 second sleep it would tend to work most of the time.
I'm not sure what you're doing. Is there a different test elsewhere in the thread?
See my updated stress test that I sent you in my last email.
That makes sense.
I think you hit the nail on the head here. It explains a lot.
Thanks,
-- Drew

On 9/29/2011 12:06 AM, Andrew Case wrote:
As long as Mailman's locks directory is a single NFS shared directory, there should be no problem. The problem you have is due to the bug, and the patch should fix it.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 9/28/2011 3:08 PM, Andrew Case wrote:
There is a bug in the Mailman 2.1 branch, but the above is not it. The above log shows that acase@example.com was added by admin mass subscribe at 17:15:14 and then a bit more than 4 minutes later, was removed by checking the unsub box on the admin Membership List and submitting.
If you check your web server logs, you will find POST transactions to the admin page for both these events.
This is possibly a manifestation of the bug, but I'm surprised it is happening that frequently.
The thread you point to above is relevant, but it is not a locking issue. The problem is due to list caching in Mailman/Queue/Runner.py and/or nearly concurrent processes which first load the list unlocked and later lock it. The issue is that the resolution of the config.pck timestamp is 1 second, and if a process has a list object and that list object is updated by another process within the same second as the timestamp on the first process's object, the first process won't load the updated list when it locks it. This can result in things like a subscribe being done and logged and then silently reversed.
List locking is working as it should. The issue is that the first process doesn't reload the updated list when it acquires the lock because it thinks it already has the latest version.
I thought I had fixed this on the 2.1 branch, but it seems I only fixed it for the now defunct 2.2 branch.
A relevant thread starts at <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html> and continues at <http://mail.python.org/pipermail/mailman-developers/2008-August/020329.html>
The patch in the attached cache.patch file should fix it.
That won't do it. bin/add_members alone will do it, but only if there is a nearly concurrent process updating the same list.
The post at <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html> contains a "stress test" that will probably reproduce the problem.
I suspect your Mailman server must be very busy for you to see this bug that frequently. However, it looks like I need to install the fix for Mailman 2.1.15.
It is also curious that the only reports of this that I can recall both come from solaris users. There may be complications in your case due to NFS, but locking shouldn't be the issue. Run the stress test and see if it fails. If it does, try the patch.
Let us know what happens.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Thanks Mark, see inline comments.
I was trying to show that even after the user was removed, they're still listed as a member.
If you check your web server logs, you will find POST transactions to the admin page for both these events.
Agreed.
Easiest way for me to replicated the problem is:
- check the unsubscribe box for user A then hit submit
- after reload check the unsubscribe box for user B then hit submit
- reaload the "membership list" page and user B is back on the list
This happens even after I wait a couple seconds in between each step.
The result sounds the same, but would this happen even if I'm loading the page with more than a second in between each step outlined above?
I applied the patch but it doesn't seem to have made a difference.
Correct. Only one subscriber was subscribed to each test list. Keep in mind that in the stress test given if you use a sleep counter of 5 with 6 lists, that means you're waiting _30 seconds_ before the next add_member command is run for that list (I'm assume the timing issue is per-list, not per run of add_members). Even if you set the timer down to 1 that's a 6 second sleep. This shouldn't effect a cache that we're comparing for the given second. Anyway, my script ran fine with the 5 second sleep (30 seconds per list add), but showed discrepancies with a 3 second sleep.
We run about 600 different mailing lists for our department and this has been a continues headache. I appreciate all the hard work you guys do.
Patch didn't seem to help. Is there an easy way to omit the caching in this?
Thanks,
Drew

Mark,
I realized I hadn't restarted my QRunners after this patch. It looks like its working perfectly now! Even with a sleep of 0. Thanks so much!
-- Drew
On Thu, September 29, 2011 2:52 am, Andrew Case wrote:
Andrew Case Systems Administrator Courant Institute of Mathematical Sciences New York University 251 Mercer St., Room 1023 New York, NY 10012-1110 Phone: 212-998-3147

Mark,
I don't know if you'll find this worth trying to fix, but I revised the stress test this morning to be a bit more stressful. The biggest change I made (besides upping the number of subscribers) was that instead of adding each member to all the lists before adding the second member, it now adds all the members to one list successively before moving on the the next list. This causes a much lower amount of time between subscriptions per list, causing the config.pck files to be updated in a much smaller timeframe.
When I did this I saw that maybe 1-5% of the time a user was still omitted from the list (user was silently removed). I think that because these are processed by the queue runner on a different host and because the timestamp check is being done on an NFS stored file, there is potential that the qrunner for this doesn't yet have an updated mtime for that file (or even a small ntp time drift could cause this). When I commented out the caching part of the code in MailList.py this bug never seems to show up: #if mtime < self.__timestamp: # # File is not newer # return None, None
So I think there may still be a race condition here, but the chances of it are unlikely that human interaction would trigger this. If however, you have a script that is subscribing users (one after another), this could still come up. I actually happen to have such a script, but I run it on the same host as the qrunners, so I haven't experienced this before.
In my case I think it's probably not worth keeping the performance gain that the caching adds for sake of consistency.
Attached is the modified stress test I'm using.
Thanks again,
-- Drew
On Thu, September 29, 2011 4:13 am, Andrew Case wrote:
Andrew Case Systems Administrator Courant Institute of Mathematical Sciences New York University 251 Mercer St., Room 1023 New York, NY 10012-1110 Phone: 212-998-3147

On 9/29/2011 8:58 AM, Andrew Case wrote:
Actually, that is not the cache. It is just the test for whether the current list object, cached or whatever, needs to be reloaded from disk.
I think that your configuration with NFS and possible time jitter between servers makes the bug more likely.
It can happen even where everything is on a single host, but as I said, I think your configuration makes it more likely.
In my case I think it's probably not worth keeping the performance gain that the caching adds for sake of consistency.
Attached is a patch to remove list caching from the qrunners. This patch has the additional advantage of limiting the growth of the qrunners over time. Old entries were supposed to be freed from the cache, but a self reference in the default MemberAdaptor prevented this from occurring.
For reasons of trying not to be disruptive this patch and the bug fix I sent earlier were never applied to the 2.1 branch. I think this was a mistake, and I will apply them for Mailman 2.1.15.
Attached is the modified stress test I'm using.
Thanks.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Hey Mark,
Take a look at these results that are with the 2 patches you've sent me (and qrunners restarted). I'm adding 10 users to 6 lists:
[... creating list output cut ...]
- Subscribing to testlist1 Subscribed: foo1@www.cs.nyu.edu Subscribed: foo2@www.cs.nyu.edu Subscribed: foo3@www.cs.nyu.edu Subscribed: foo4@www.cs.nyu.edu Subscribed: foo5@www.cs.nyu.edu Subscribed: foo6@www.cs.nyu.edu Subscribed: foo7@www.cs.nyu.edu Subscribed: foo8@www.cs.nyu.edu Subscribed: foo9@www.cs.nyu.edu Subscribed: foo10@www.cs.nyu.edu
- Subscribing to testlist2 Subscribed: foo1@www.cs.nyu.edu Subscribed: foo2@www.cs.nyu.edu Subscribed: foo3@www.cs.nyu.edu Traceback (most recent call last): File "/usr/mailman/bin/add_members", line 258, in <module> main() File "/usr/mailman/bin/add_members", line 238, in main addall(mlist, nmembers, 0, send_welcome_msg, s) File "/usr/mailman/bin/add_members", line 135, in addall mlist.ApprovedAddMember(userdesc, ack, 0) File "/usr/mailman/Mailman/MailList.py", line 948, in ApprovedAddMember assert self.Locked() AssertionError Subscribed: foo5@www.cs.nyu.edu Subscribed: foo6@www.cs.nyu.edu Subscribed: foo7@www.cs.nyu.edu Subscribed: foo8@www.cs.nyu.edu Subscribed: foo9@www.cs.nyu.edu Subscribed: foo10@www.cs.nyu.edu
- Subscribing to testlist3 Subscribed: foo1@www.cs.nyu.edu
[... subscribing users to all other lists went fine ...]
- Subscribers for testlist1: foo10@www.cs.nyu.edu foo1@www.cs.nyu.edu foo2@www.cs.nyu.edu foo3@www.cs.nyu.edu foo4@www.cs.nyu.edu foo5@www.cs.nyu.edu foo6@www.cs.nyu.edu foo7@www.cs.nyu.edu foo8@www.cs.nyu.edu foo9@www.cs.nyu.edu
- Removing list testlist1
- Subscribers for testlist2: foo1@www.cs.nyu.edu foo2@www.cs.nyu.edu foo3@www.cs.nyu.edu foo5@www.cs.nyu.edu foo6@www.cs.nyu.edu foo7@www.cs.nyu.edu foo8@www.cs.nyu.edu foo9@www.cs.nyu.edu
- Removing list testlist2
[... the rest were all fine ...]
There was a locking issue with testlist2 foo4, which is fine since it doesn't report back as successful. But you'll also notice that foo10 wasn't listed as a subscriber even though it appeared as though that subscribe was successful.
Here's some errors on the very next run where I'm subscribing 10 people to each list as well:
[... cut expected results ...]
- Subscribers for testlist4: foo10@www.cs.nyu.edu foo1@www.cs.nyu.edu foo2@www.cs.nyu.edu foo3@www.cs.nyu.edu foo4@www.cs.nyu.edu foo5@www.cs.nyu.edu foo6@www.cs.nyu.edu [** no foo8 **] foo7@www.cs.nyu.edu foo9@www.cs.nyu.edu
- Removing list testlist4
- Subscribers for testlist5: foo10@www.cs.nyu.edu foo1@www.cs.nyu.edu foo2@www.cs.nyu.edu foo3@www.cs.nyu.edu foo4@www.cs.nyu.edu foo5@www.cs.nyu.edu foo6@www.cs.nyu.edu foo7@www.cs.nyu.edu [** no foo8 or foo9 **]
- Removing list testlist5 [... cut expected results ...]
That's my (1-5%) failure. But when I also comment out the following:
It seems to work each time (I ran 3 tests in a row, all with expected results).
Let me know what you think.
Thanks,
-- Drew
On Thu, September 29, 2011 1:30 pm, Mark Sapiro wrote:
Andrew Case Systems Administrator Courant Institute of Mathematical Sciences New York University 251 Mercer St., Room 1023 New York, NY 10012-1110 Phone: 212-998-3147

Andrew Case wrote:
The above is definitely a problem, but I can't see how it can occur unless there is some race condition at the level of the file system.
[...]
Yes, I see that and both are cause for concern. I am more concerned about the AssertionError with testlist2 foo4, because I can't see how that can happen without some file system anomaly.
The code above is a decision about whether we need to reload the list object from the file system based on the file system time stamp vs. our internal time stamp.
The original code said "if mtime <= self.__timestamp:". Since these time stamps are in whole seconds, that test (to skip loading) could succeed even if the file time stamp was a fraction of a second newer than the internal time stamp. Thus, the bug.
The fix is to make the test "if mtime < self.__timestamp:" meanin we only skip loading if the file time stamp is strictly less than the internal time stamp, but in your case, if the clock on the MTA/qrunner machine is a bit faster than that on the machine running the script, the internal time stamp of the qrunner process could be a second ahead of the file time stamp set by the add_members process even though on an absolute scale it is older.
So, I think we're concerned here about clock skew between the machines, and in that case, commenting out the code completely as you have done makes sense.
I have filed bug reports at <https://bugs.launchpad.net/mailman/+bug/862675> and <https://bugs.launchpad.net/mailman/+bug/862683> about these issues in preparation to fixing them.
Now I don't know whether to stick with the "if mtime < self.__timestamp:" test which will work on a single server or to reload unconditionally as you have done which seems to be necessary in a shared file system situation with possible time skews.
As far as the AssertionError is concerned, I don't know what to make of it. It appears to be a file system issue outside of Mailman, so I don't know how to deal with it. I think the code in Mailman's LockFile module is correct. If you are willing to try debugging this further, you could set
LIST_LOCK_DEBUGGING = True
in mm_cfg.py and restart your qrunners and try to reproduce the exception. This will log copious information to Mailman's 'locks' log which may help to understand what happened.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 9/28/2011 11:52 PM, Andrew Case wrote:
Sorry, I missed that. You are correct and this does appear to be a manifestation of the same issue.
[...]
It is tricky. Each add_members, remove_members and web CGI post is a separate process. If these processes are run sequentially, there should not be any problem because each process will load the list, lock it update it and save it before the next process loads it.
The problem occurs when processes run concurrently. The scenario is process A loads the list unlocked; process B locks the list and updates it; process A tries to lock the list and gets the lock after process B relinquishes it; if the timestamp on the config.pck from process B's update is in the same second as the timestamp of process A's initial load, process A thinks the list hasn't been updated and doesn't reload it after obtaining the lock. Thus, when process A saves the list, process B's changes are reversed.
This is complicated by list caching in the qrunners because each qrunner may have a cached copy of the list, so it can act as process A in the above scenario with its cached copy playing the role of the initially loaded list. To complicate this further, the qrunners get involved even in the simple scenario with sequential commands because add_members, remove_members and CGIs result in notices being sent, and the qrunner processes that send the notices are running concurrently. This is why the stress test will fail even though commands are run sequentially.
[...]
I applied the patch but it doesn't seem to have made a difference.
As you later report, restarting the qrunners did seem to fix it.
[...]
So you are adding 'sleep' commands after each add_members? I'm not sure what you're doing. Is there a different test elsewhere in the thread?
I have used a couple of tests as attached. They are the same except for list order and are very similar to the one in the original thread. Note that they contain only one sleep after all the add_members just to allow things to settle before running list_members.
Actually, I don't think the issue is the busy server. I think it is more likely that NFS causes timing issues between add_members and VirginRunner and OutgoingRunner that just make the bug more likely to trigger.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

[...]
Thank you for that explanation. I did seem to have confusion as to when the qrunners cache and/or update these config.pck files and when the add/remove_members commands did as well. There seemed to be some sort of conflict between the two.
[...]
Yes I was. Without a sleep in between add_member calls, it was failing for ~50% of the calls to add_members. With a 5 second sleep it would tend to work most of the time.
I'm not sure what you're doing. Is there a different test elsewhere in the thread?
See my updated stress test that I sent you in my last email.
That makes sense.
I think you hit the nail on the head here. It explains a lot.
Thanks,
-- Drew

On 9/29/2011 12:06 AM, Andrew Case wrote:
As long as Mailman's locks directory is a single NFS shared directory, there should be no problem. The problem you have is due to the bug, and the patch should fix it.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Andrew Case
-
Mark Sapiro