[Mailman-Developers] Faulty Member Subscribe/Unsubscribes
acase at cims.nyu.edu
Thu Sep 29 08:52:59 CEST 2011
Thanks Mark, see inline comments.
>> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase
>> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass
>> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member
>> mgt page
>> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase
>> acase at example.com
>> [mailman at myhost] ~/logs |>
> There is a bug in the Mailman 2.1 branch, but the above is not it. The
> above log shows that acase at example.com was added by admin mass subscribe
> at 17:15:14 and then a bit more than 4 minutes later, was removed by
> checking the unsub box on the admin Membership List and submitting.
I was trying to show that even after the user was removed, they're still
listed as a member.
> If you check your web server logs, you will find POST transactions to
> the admin page for both these events.
>> The same also happens when subscribing. I will mass subscribe users (or
>> when users confirm subscription via email/web), the logs indicated that
>> they have been subscribed successfully, but then when I go look them up,
>> they are not listed on the members list.
>> This happens sporadically, but I am generally able to reproduce the
>> if I do it a couple times in a row.
> This is possibly a manifestation of the bug, but I'm surprised it is
> happening that frequently.
Easiest way for me to replicated the problem is:
1) check the unsubscribe box for user A then hit submit
2) after reload check the unsubscribe box for user B then hit submit
3) reaload the "membership list" page and user B is back on the list
This happens even after I wait a couple seconds in between each step.
>> I'm suspicious there may be a locking issue and config.pck is reverting
>> config.pck.last. I found this thread rather helpful in analyzing
>> potential problems, but I have yet to figure anything out:
> The thread you point to above is relevant, but it is not a locking
> issue. The problem is due to list caching in Mailman/Queue/Runner.py
> and/or nearly concurrent processes which first load the list unlocked
> and later lock it. The issue is that the resolution of the config.pck
> timestamp is 1 second, and if a process has a list object and that list
> object is updated by another process within the same second as the
> timestamp on the first process's object, the first process won't load
> the updated list when it locks it. This can result in things like a
> subscribe being done and logged and then silently reversed.
The result sounds the same, but would this happen even if I'm loading the
page with more than a second in between each step outlined above?
> List locking is working as it should. The issue is that the first
> process doesn't reload the updated list when it acquires the lock
> because it thinks it already has the latest version.
> I thought I had fixed this on the 2.1 branch, but it seems I only fixed
> it for the now defunct 2.2 branch.
> A relevant thread starts at
> and continues at
> The patch in the attached cache.patch file should fix it.
I applied the patch but it doesn't seem to have made a difference.
>> In addition if I just run the following commands over and over, then the
>> bug never seems to come up. This is part of why I am worrying about
>> bin/add_members ...
>> bin/remove_members ...
> That won't do it. bin/add_members alone will do it, but only if there is
> a nearly concurrent process updating the same list.
>> Is there a good way to test locking between servers? I've run the
>> tests/test_lockfile.py, but it reports it is OK.
>> Any and all help would be GREATLY appreciated. We've been trying to
>> triage this bug for weeks and it is terribly disruptive for our users.
> The post at
> contains a "stress test" that will probably reproduce the problem.
Correct. Only one subscriber was subscribed to each test list. Keep in
mind that in the stress test given if you use a sleep counter of 5 with 6
lists, that means you're waiting _30 seconds_ before the next add_member
command is run for that list (I'm assume the timing issue is per-list, not
per run of add_members). Even if you set the timer down to 1 that's a 6
second sleep. This shouldn't effect a cache that we're comparing for the
given second. Anyway, my script ran fine with the 5 second sleep (30
seconds per list add), but showed discrepancies with a 3 second sleep.
> I suspect your Mailman server must be very busy for you to see this bug
> that frequently. However, it looks like I need to install the fix for
> Mailman 2.1.15.
We run about 600 different mailing lists for our department and this has
been a continues headache. I appreciate all the hard work you guys do.
> It is also curious that the only reports of this that I can recall both
> come from solaris users. There may be complications in your case due to
> NFS, but locking shouldn't be the issue. Run the stress test and see if
> it fails. If it does, try the patch.
Patch didn't seem to help. Is there an easy way to omit the caching in this?
> Let us know what happens.
> Mark Sapiro <mark at msapiro.net> The highway is for gamblers,
> San Francisco Bay Area, California better use your sense - B. Dylan
More information about the Mailman-Developers