[Mailman-Developers] Faulty Member Subscribe/Unsubscribes

Andrew Case acase at cims.nyu.edu
Thu Sep 29 20:36:26 CEST 2011


Hey Mark,

Take a look at these results that are with the 2 patches you've sent me
(and qrunners restarted).  I'm adding 10 users to 6 lists:

[... creating list output cut ...]

 + Subscribing to testlist1
Subscribed: foo1 at www.cs.nyu.edu
Subscribed: foo2 at www.cs.nyu.edu
Subscribed: foo3 at www.cs.nyu.edu
Subscribed: foo4 at www.cs.nyu.edu
Subscribed: foo5 at www.cs.nyu.edu
Subscribed: foo6 at www.cs.nyu.edu
Subscribed: foo7 at www.cs.nyu.edu
Subscribed: foo8 at www.cs.nyu.edu
Subscribed: foo9 at www.cs.nyu.edu
Subscribed: foo10 at www.cs.nyu.edu
 + Subscribing to testlist2
Subscribed: foo1 at www.cs.nyu.edu
Subscribed: foo2 at www.cs.nyu.edu
Subscribed: foo3 at www.cs.nyu.edu
Traceback (most recent call last):
  File "/usr/mailman/bin/add_members", line 258, in <module>
    main()
  File "/usr/mailman/bin/add_members", line 238, in main
    addall(mlist, nmembers, 0, send_welcome_msg, s)
  File "/usr/mailman/bin/add_members", line 135, in addall
    mlist.ApprovedAddMember(userdesc, ack, 0)
  File "/usr/mailman/Mailman/MailList.py", line 948, in ApprovedAddMember
    assert self.Locked()
AssertionError
Subscribed: foo5 at www.cs.nyu.edu
Subscribed: foo6 at www.cs.nyu.edu
Subscribed: foo7 at www.cs.nyu.edu
Subscribed: foo8 at www.cs.nyu.edu
Subscribed: foo9 at www.cs.nyu.edu
Subscribed: foo10 at www.cs.nyu.edu
 + Subscribing to testlist3
Subscribed: foo1 at www.cs.nyu.edu

[... subscribing users to all other lists went fine ...]


 + Subscribers for testlist1:
foo10 at www.cs.nyu.edu
foo1 at www.cs.nyu.edu
foo2 at www.cs.nyu.edu
foo3 at www.cs.nyu.edu
foo4 at www.cs.nyu.edu
foo5 at www.cs.nyu.edu
foo6 at www.cs.nyu.edu
foo7 at www.cs.nyu.edu
foo8 at www.cs.nyu.edu
foo9 at www.cs.nyu.edu
 + Removing list testlist1
 + Subscribers for testlist2:
foo1 at www.cs.nyu.edu
foo2 at www.cs.nyu.edu
foo3 at www.cs.nyu.edu
foo5 at www.cs.nyu.edu
foo6 at www.cs.nyu.edu
foo7 at www.cs.nyu.edu
foo8 at www.cs.nyu.edu
foo9 at www.cs.nyu.edu
 + Removing list testlist2

[... the rest were all fine ...]



There was a locking issue with testlist2 foo4, which is fine since it
doesn't report back as successful.  But you'll also notice that foo10
wasn't listed as a subscriber even though it appeared as though that
subscribe was successful.


Here's some errors on the very next run where I'm subscribing 10 people to
each list as well:

[... cut expected results ...]
 + Subscribers for testlist4:
foo10 at www.cs.nyu.edu
foo1 at www.cs.nyu.edu
foo2 at www.cs.nyu.edu
foo3 at www.cs.nyu.edu
foo4 at www.cs.nyu.edu
foo5 at www.cs.nyu.edu
foo6 at www.cs.nyu.edu [** no foo8 **]
foo7 at www.cs.nyu.edu
foo9 at www.cs.nyu.edu
 + Removing list testlist4
 + Subscribers for testlist5:
foo10 at www.cs.nyu.edu
foo1 at www.cs.nyu.edu
foo2 at www.cs.nyu.edu
foo3 at www.cs.nyu.edu
foo4 at www.cs.nyu.edu
foo5 at www.cs.nyu.edu
foo6 at www.cs.nyu.edu
foo7 at www.cs.nyu.edu [** no foo8 or foo9 **]
 + Removing list testlist5
[... cut expected results ...]


That's my (1-5%) failure.  But when I also comment out the following:

>>             #if mtime < self.__timestamp:
>>             #    # File is not newer
>>             #    return None, None

It seems to work each time (I ran 3 tests in a row, all with expected
results).

Let me know what you think.

Thanks,

--
Drew


On Thu, September 29, 2011 1:30 pm, Mark Sapiro wrote:
> On 9/29/2011 8:58 AM, Andrew Case wrote:
>>
>> When I did this I saw that maybe 1-5% of the time a user was still
>> omitted
>> from the list (user was silently removed).  I think that because these
>> are
>> processed by the queue runner on a different host and because the
>> timestamp check is being done on an NFS stored file, there is potential
>> that the qrunner for this doesn't yet have an updated mtime for that
>> file
>> (or even a small ntp time drift could cause this).  When I commented out
>> the caching part of the code in MailList.py this bug never seems to show
>> up:
>>             #if mtime < self.__timestamp:
>>             #    # File is not newer
>>             #    return None, None
>
>
> Actually, that is not the cache. It is just the test for whether the
> current list object, cached or whatever, needs to be reloaded from disk.
>
> I think that your configuration with NFS and possible time jitter
> between servers makes the bug more likely.
>
>
>> So I think there may still be a race condition here, but the chances of
>> it
>> are unlikely that human interaction would trigger this.  If however, you
>> have a script that is subscribing users (one after another), this could
>> still come up.  I actually happen to have such a script, but I run it on
>> the same host as the qrunners, so I haven't experienced this before.
>
>
> It can happen even where everything is on a single host, but as I said,
> I think your configuration makes it more likely.
>
>
>> In my case I think it's probably not worth keeping the performance gain
>> that the caching adds for sake of consistency.
>
>
> Attached is a patch to remove list caching from the qrunners. This patch
> has the additional advantage of limiting the growth of the qrunners over
> time. Old entries were supposed to be freed from the cache, but a self
> reference in the default MemberAdaptor prevented this from occurring.
>
> For reasons of trying not to be disruptive this patch and the bug fix I
> sent earlier were never applied to the 2.1 branch. I think this was a
> mistake, and I will apply them for Mailman 2.1.15.
>
>
>> Attached is the modified stress test I'm using.
>
>
> Thanks.
>
> --
> Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
> San Francisco Bay Area, California    better use your sense - B. Dylan
>
>


Andrew Case
Systems Administrator
Courant Institute of Mathematical Sciences
New York University
251 Mercer St., Room 1023
New York, NY 10012-1110
Phone: 212-998-3147




More information about the Mailman-Developers mailing list