[Mailman-Developers] Faulty Member Subscribe/Unsubscribes

Thu Sep 29 04:48:19 CEST 2011

On 9/28/2011 3:08 PM, Andrew Case wrote:
> My configuration:
>   Mailman: 2.1.14
>   OS: Solaris 10
>   Python: 2.4.5
>   PREFIX = '/usr/mailman'
>   Server setup: 1 server for web management, 1 server for MTA/qrunner. 
> /usr/mailman is NFS mounted on both servers
> 
> 
> I've been having the following issue my mailman lists:
> 
> A user is either subscribed or unsubscribed according to the logs, but
> then if I look at the member list, the action has not been done (or has
> been undone).  For example, here is where I remove a subscriber and then
> look at the list members and they are still in the list:
> 
> [mailman at myhost] ~/logs |> grep testlist subscribe | grep acase
> Sep 28 17:15:14 2011 (4401) testlist: new acase at example.com, admin mass sub
> Sep 28 17:19:36 2011 (5821) testlist: deleted acase at example.com; member
> mgt page
> [mailman at myhost] ~/logs |> ../bin/list_members testlist | grep acase
> acase at example.com
> [mailman at myhost] ~/logs |>

There is a bug in the Mailman 2.1 branch, but the above is not it. The
above log shows that acase at example.com was added by admin mass subscribe
at 17:15:14 and then a bit more than 4 minutes later, was removed by
checking the unsub box on the admin Membership List and submitting.

If you check your web server logs, you will find POST transactions to
the admin page for both these events.

> The same also happens when subscribing.  I will mass subscribe users (or
> when users confirm subscription via email/web), the logs indicated that
> they have been subscribed successfully, but then when I go look them up,
> they are not listed on the members list.
> 
> This happens sporadically, but I am generally able to reproduce the error
> if I do it a couple times in a row.

This is possibly a manifestation of the bug, but I'm surprised it is
happening that frequently.

> I'm suspicious there may be a locking issue and config.pck is reverting to
> config.pck.last.  I found this thread rather helpful in analyzing
> potential problems, but I have yet to figure anything out:
>   http://web.archiveorange.com/archive/v/IezAOgEQf7xEYCSEJTbD

The thread you point to above is relevant, but it is not a locking
issue. The problem is due to list caching in Mailman/Queue/Runner.py
and/or nearly concurrent processes which first load the list unlocked
and later lock it. The issue is that the resolution of the config.pck
timestamp is 1 second, and if a process has a list object and that list
object is updated by another process within the same second as the
timestamp on the first process's object, the first process won't load
the updated list when it locks it. This can result in things like a
subscribe being done and logged and then silently reversed.

List locking is working as it should. The issue is that the first
process doesn't reload the updated list when it acquires the lock
because it thinks it already has the latest version.

I thought I had fixed this on the 2.1 branch, but it seems I only fixed
it for the now defunct 2.2 branch.

A relevant thread starts at
<http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
and continues at
<http://mail.python.org/pipermail/mailman-developers/2008-August/020329.html>

The patch in the attached cache.patch file should fix it.

> In addition if I just run the following commands over and over, then the
> bug never seems to come up.  This is part of why I am worrying about
> locking:
>   bin/add_members ...
>   bin/remove_members ...

That won't do it. bin/add_members alone will do it, but only if there is
a nearly concurrent process updating the same list.

> Is there a good way to test locking between servers?  I've run the
> tests/test_lockfile.py, but it reports it is OK.
> 
> Any and all help would be GREATLY appreciated.  We've been trying to
> triage this bug for weeks and it is terribly disruptive for our users.

The post at
<http://mail.python.org/pipermail/mailman-users/2008-August/062862.html>
contains a "stress test" that will probably reproduce the problem.

I suspect your Mailman server must be very busy for you to see this bug
that frequently. However, it looks like I need to install the fix for
Mailman 2.1.15.

It is also curious that the only reports of this that I can recall both
come from solaris users. There may be complications in your case due to
NFS, but locking shouldn't be the issue. Run the stress test and see if
it fails. If it does, try the patch.

Let us know what happens.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cache.patch
URL: <http://mail.python.org/pipermail/mailman-developers/attachments/20110928/3b4713f1/attachment.ksh>