[Mailman-Developers] Faulty Member Subscribe/Unsubscribes

Andrew Case acase at cims.nyu.edu
Thu Sep 29 17:58:54 CEST 2011


Mark,

I don't know if you'll find this worth trying to fix, but I revised the
stress test this morning to be a bit more stressful.  The biggest change I
made (besides upping the number of subscribers) was that instead of adding
each member to all the lists before adding the second member, it now adds
all the members to one list successively before moving on the the next
list.  This causes a much lower amount of time between subscriptions per
list, causing the config.pck files to be updated in a much smaller
timeframe.

When I did this I saw that maybe 1-5% of the time a user was still omitted
from the list (user was silently removed).  I think that because these are
processed by the queue runner on a different host and because the
timestamp check is being done on an NFS stored file, there is potential
that the qrunner for this doesn't yet have an updated mtime for that file
(or even a small ntp time drift could cause this).  When I commented out
the caching part of the code in MailList.py this bug never seems to show
up:
            #if mtime < self.__timestamp:
            #    # File is not newer
            #    return None, None

So I think there may still be a race condition here, but the chances of it
are unlikely that human interaction would trigger this.  If however, you
have a script that is subscribing users (one after another), this could
still come up.  I actually happen to have such a script, but I run it on
the same host as the qrunners, so I haven't experienced this before.

In my case I think it's probably not worth keeping the performance gain
that the caching adds for sake of consistency.

Attached is the modified stress test I'm using.

Thanks again,

--
Drew


On Thu, September 29, 2011 4:13 am, Andrew Case wrote:
> Mark,
>
> I realized I hadn't restarted my QRunners after this patch.  It looks like
> its working perfectly now!  Even with a sleep of 0.  Thanks so much!
>
> --
> Drew
>
> On Thu, September 29, 2011 2:52 am, Andrew Case wrote:
>> Thanks Mark, see inline comments.
>>
>>>> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase
>>>> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin
>>>> mass
>>>> sub
>>>> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com;
>>>> member
>>>> mgt page
>>>> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase
>>>> acase at example.com
>>>> [mailman at myhost] ~/logs |>
>>>
>>>
>>> There is a bug in the Mailman 2.1 branch, but the above is not it. The
>>> above log shows that acase at example.com was added by admin mass
>>> subscribe
>>> at 17:15:14 and then a bit more than 4 minutes later, was removed by
>>> checking the unsub box on the admin Membership List and submitting.
>>
>> I was trying to show that even after the user was removed, they're still
>> listed as a member.
>>
>>> If you check your web server logs, you will find POST transactions to
>>> the admin page for both these events.
>>
>> Agreed.
>>
>>>> The same also happens when subscribing.  I will mass subscribe users
>>>> (or
>>>> when users confirm subscription via email/web), the logs indicated
>>>> that
>>>> they have been subscribed successfully, but then when I go look them
>>>> up,
>>>> they are not listed on the members list.
>>>>
>>>> This happens sporadically, but I am generally able to reproduce the
>>>> error
>>>> if I do it a couple times in a row.
>>>
>>>
>>> This is possibly a manifestation of the bug, but I'm surprised it is
>>> happening that frequently.
>>
>> Easiest way for me to replicated the problem is:
>> 1) check the unsubscribe box for user A then hit submit
>> 2) after reload check the unsubscribe box for user B then hit submit
>> 3) reaload the "membership list" page and user B is back on the list
>>
>> This happens even after I wait a couple seconds in between each step.
>>
>>>> I'm suspicious there may be a locking issue and config.pck is
>>>> reverting
>>>> to
>>>> config.pck.last.  I found this thread rather helpful in analyzing
>>>> potential problems, but I have yet to figure anything out:
>>>>   http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD
>>>
>>>
>>> The thread you point to above is relevant, but it is not a locking
>>> issue. The problem is due to list caching in Mailman/Queue/Runner.py
>>> and/or nearly concurrent processes which first load the list unlocked
>>> and later lock it. The issue is that the resolution of the config.pck
>>> timestamp is 1 second, and if a process has a list object and that list
>>> object is updated by another process within the same second as the
>>> timestamp on the first process's object, the first process won't load
>>> the updated list when it locks it. This can result in things like a
>>> subscribe being done and logged and then silently reversed.
>>
>> The result sounds the same, but would this happen even if I'm loading
>> the
>> page with more than a second in between each step outlined above?
>>
>>> List locking is working as it should. The issue is that the first
>>> process doesn't reload the updated list when it acquires the lock
>>> because it thinks it already has the latest version.
>>>
>>> I thought I had fixed this on the 2.1 branch, but it seems I only fixed
>>> it for the now defunct 2.2 branch.
>>>
>>> A relevant thread starts at
>>> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
>>> and continues at
>>> <http://mail.python.org/pipermail/mailman-developers/2008-August/020329.html>
>>>
>>> The patch in the attached cache.patch file should fix it.
>>
>> I applied the patch but it doesn't seem to have made a difference.
>>
>>
>>>> In addition if I just run the following commands over and over, then
>>>> the
>>>> bug never seems to come up.  This is part of why I am worrying about
>>>> locking:
>>>>   bin/add_members ...
>>>>   bin/remove_members ...
>>>
>>>
>>> That won't do it. bin/add_members alone will do it, but only if there
>>> is
>>> a nearly concurrent process updating the same list.
>>>
>>>
>>>> Is there a good way to test locking between servers?  I've run the
>>>> tests/test_lockfile.py, but it reports it is OK.
>>>>
>>>> Any and all help would be GREATLY appreciated.  We've been trying to
>>>> triage this bug for weeks and it is terribly disruptive for our users.
>>>
>>>
>>> The post at
>>> <http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
>>> contains a "stress test" that will probably reproduce the problem.
>>
>> Correct.  Only one subscriber was subscribed to each test list.  Keep in
>> mind that in the stress test given if you use a sleep counter of 5 with
>> 6
>> lists, that means you're waiting _30 seconds_ before the next add_member
>> command is run for that list (I'm assume the timing issue is per-list,
>> not
>> per run of add_members).  Even if you set the timer down to 1 that's a 6
>> second sleep.  This shouldn't effect a cache that we're comparing for
>> the
>> given second.  Anyway, my script ran fine with the 5 second sleep (30
>> seconds per list add), but showed discrepancies with a 3 second sleep.
>>
>>> I suspect your Mailman server must be very busy for you to see this bug
>>> that frequently. However, it looks like I need to install the fix for
>>> Mailman 2.1.15.
>>
>> We run about 600 different mailing lists for our department and this has
>> been a continues headache.  I appreciate all the hard work you guys do.
>>
>>> It is also curious that the only reports of this that I can recall both
>>> come from solaris users. There may be complications in your case due to
>>> NFS, but locking shouldn't be the issue. Run the stress test and see if
>>> it fails. If it does, try the patch.
>>
>> Patch didn't seem to help.  Is there an easy way to omit the caching in
>> this?
>>
>> Thanks,
>> --
>> Drew
>>
>>>
>>> Let us know what happens.
>>>
>>> --
>>> Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
>>> San Francisco Bay Area, California    better use your sense - B. Dylan
>>>
>>>
>>
>>
>
>
> Andrew Case
> Systems Administrator
> Courant Institute of Mathematical Sciences
> New York University
> 251 Mercer St., Room 1023
> New York, NY 10012-1110
> Phone: 212-998-3147
>
>


Andrew Case
Systems Administrator
Courant Institute of Mathematical Sciences
New York University
251 Mercer St., Room 1023
New York, NY 10012-1110
Phone: 212-998-3147
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test_subscribe
Type: application/octet-stream
Size: 759 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/mailman-developers/attachments/20110929/1b233f3d/attachment.obj>


More information about the Mailman-Developers mailing list