
It took all of my sunday, but I just finished porting Ben Gertzfield's excellent dupe removal patch to mailman cvs (I also had to learn some python in the process. I'm starting to believe that Mailman is a conspiracy to get people to learn python :-p)
In a nutshell, the patch does two things:
it does not send you your list copy if
- your subscribed Email address is already in the headers
- you already received the message through another list (Cc accross two lists or more on the same site)
The new "nodupes" setting is really something you probably want as a default on all lists. I also had lists were people wanted notmetoo as a default too. Ben's fix for that is to have a bitfield per list that you can set and that states which options newly added users get.
As Ben said, this breaks the one patch one functionality rule, but when I ported his work to mailman-cvs, I realized that it didn't make sense to take them apart. However, Barry, if that would stop you from merging #1 in CVS, I could remove it, but I'm not sure why one would want to.
I've done reasonable tests to make sure I didn't break all of mailman in the process, and the core logic hasn't changed, so the basic functionality is the same that Ben had written and that has been used for 6-9mo? on the debian lists now. In other words, it should work (it does for me, and I'm already running it on my production mailman-cvs list server), but there is always the chance that there might be a corner case buglet left somewhere.
Considering this was a pain to port, and how this puts to rest many of the reply-to munging discussions (the only real argument for reply-to munging is that it "solves" the duplicate mails you other receive when people use reply to all), I'm hoping that this could make it in (wink, wink :-D)
Thanks, Marc
Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key

"Marc" == Marc MERLIN <marc_news@vasoftware.com> writes:
Marc> It took all of my sunday, but I just finished porting Ben
Marc> Gertzfield's excellent dupe removal patch to mailman cvs (I
Marc> also had to learn some python in the process. I'm starting
Marc> to believe that Mailman is a conspiracy to get people to
Marc> learn python :-p)
Fantastic, Marc! Sorry I've been lazy and haven't been able to port it. I'll install this patch on our test server at work and let you know if I have any problems tomorrow.
Ben
-- Brought to you by the letters O and T and the number 11. "You forgot Uranus." "Goooooooooodnight everybody!" Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/

"MM" == Marc MERLIN <marc_news@vasoftware.com> writes:
MM> Considering this was a pain to port, and how this puts to rest
MM> many of the reply-to munging discussions (the only real
MM> argument for reply-to munging is that it "solves" the
MM> duplicate mails you other receive when people use reply to
MM> all), I'm hoping that this could make it in (wink, wink :-D)
I'm looking at the patches now, and will have some feedback shortly (which might be in the form of checkin messages :).
-Barry

On Mon, Mar 04, 2002 at 02:25:09AM -0800, Marc MERLIN wrote:
Now comes the question: how do I retroactively reset user options for a given list of users without clicking on a web form?
The idea would be to enable nodupes for a batch of users, although I've also had the need to set a batch of users to notmetoo
I seemed to remember reading something about this, but I can't find the message.
Thanks, Marc
Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key

So, was there a way to do that, or do I need to write a wrapper with withlist?
Thanks, Marc
Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key

"MM" == Marc MERLIN <marc_news@vasoftware.com> writes:
MM> It took all of my sunday, but I just finished porting Ben
MM> Gertzfield's excellent dupe removal patch to mailman cvs (I
MM> also had to learn some python in the process. I'm starting to
MM> believe that Mailman is a conspiracy to get people to learn
MM> python :-p)
Well, of course it is! :)
Okay, I've looked over all the code. Except for some stylistic issues, which I'll just correct as I go, my biggest concern is the database used in AvoidDuplicates.py.
It looks like you're keeping an in-memory dictionary mapping addresses to a set of Message-ID:'s. You use this to decide if the recipient address has already received a message of the given Message-ID:
Let's ignore the duplicate or missing Message-ID: issue for now. The biggest problem I see is that 1) you lose all the mappings if you restart your IncomingRunner, and 2) your process will grow without bounds until you do restart your IncomingRunner.
I'm not sure about the best thing to do. Sticking this data structure in the list, or otherwise making it persistent, could take too much resources for not much gain. The second issue is more important, especially given that all our runners are now long running processes, and I think most of the unbounded memory growth issues are taken care of. Probably the best thing to do is to evict any entry in the dictionary that's older than a day or two.
Then again, this whole data structure seems intended to avoid duplicates when lists are crossposted. It shouldn't be necessary if we just want to filter out duplicates to explicitly named recipients. Maybe we don't need both features, as the former seems to be much less requested than the latter?
I think what I'll do for now is code up and test the original approach. I'm on irc now so please join me if you want to talk about it.
now-if-i-can-just-get-OPN-to-stop-kicking-me-off!-ly y'rs, -Barry

On Mon, Mar 04, 2002 at 05:16:21PM -0500, Barry A. Warsaw wrote:
Yeah, I looked at that too, but being tired, I didn't get as critical as you did. I figured it somehow worked ok for Ben for his lists (bad Marc, no cookie) Now that I think of it, Ben wrote initially wrote this for qrunner from cron, and didn't think about this issue when he ported it to mailman-cvs-sept-2001
That was my understanding of the code too.
That's probably not a problem because
- it would only affect a message being processed at the time you kill and restart IncomingRunner, not very likely, and worst case, you do get a second copy.
- You don't restart IncomingRunner often if at all
- When you do restart qrunner, there can be other qwirks, like a message being delivered twice (I've seen this with VERP enabled, I probably killed it while it was delivering a batch to exim, so it didn't complete and did it all over again after the restart)
I think you're right. You'd have to have a lot of traffic before it catches up with you, but it will eventually if you never restart qrunner.
That sounds like a reasonable plan.
That's true. The later is nice for instance when you have threads Cced accross mailman-devel and mailman-users, but having the former by itself would be good already. If this is a time issue wrt fixing the code, the duplicate message-ID code could be left behind a global option that is disabled by default and a comment saying that you should be ready to restart your qrunner weekly or daily if you enable it That said, adding a timestamp to them, and deleting everything that's more than 1H old is a better solution.
<-- barry has quit (Read error: 110 (Connection timed out))
:-)
I was saying there: about OPN, I don't know the details (and there were also politics), but the sf.net guys moved their channels to slashnet.org. Seems to work better there
Marc
Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key

"Marc" == Marc MERLIN <marc_news@vasoftware.com> writes:
Barry> Let's ignore the duplicate or missing Message-ID: issue for
Barry> now. The biggest problem I see is that 1) you lose all the
Barry> mappings if you restart your IncomingRunner ...
Marc> That's probably not a problem because 1) it would only
Marc> affect a message being processed at the time you kill and
Marc> restart IncomingRunner, not very likely, and worst case, you
Marc> do get a second copy. 2) You don't restart IncomingRunner
Marc> often if at all 3) When you do restart qrunner, there can be
Marc> other qwirks, like a message being delivered twice (I've
Marc> seen this with VERP enabled, I probably killed it while it
Marc> was delivering a batch to exim, so it didn't complete and
Marc> did it all over again after the restart)
Marc has summed up all my comments here -- the worst thing that can happen if IncomingRunner is restarted is that a duplicate is sent, which is what we do now without the patch.
Barry> 2) your process will grow without bounds until you do
Barry> restart your IncomingRunner.
Marc> I think you're right. You'd have to have a lot of traffic
Marc> before it catches up with you, but it will eventually if you
Marc> never restart qrunner.
Yeah.. I knew about this, but I think my setup had a cron job to restart the runner daily. Not at all an optimal solution, just a hack to re-implement the /etc/aliases style list functionality where a user belonging to multiple umbrella lists only receives one copy of any given mail.
Barry> I'm not sure about the best thing to do. Sticking this
Barry> data structure in the list, or otherwise making it
Barry> persistent, could take too much resources for not much
Barry> gain. The second issue is more important, especially given
Barry> that all our runners are now long running processes, and I
Barry> think most of the unbounded memory growth issues are taken
Barry> care of. Probably the best thing to do is to evict any
Barry> entry in the dictionary that's older than a day or two.
Marc> That sounds like a reasonable plan.
So, is there functionality in the *Runners to run something on a regular schedule? Say, if we clean out the structure once an hour or so, it should work pretty well.
Barry> Then again, this whole data structure seems intended to
Barry> avoid duplicates when lists are crossposted. It shouldn't
Barry> be necessary if we just want to filter out duplicates to
Barry> explicitly named recipients. Maybe we don't need both
Barry> features, as the former seems to be much less requested
Barry> than the latter?
Marc> That's true. The later is nice for instance when you have
Marc> threads Cced accross mailman-devel and mailman-users, but
Marc> having the former by itself would be good already.
I agree, and from what I understand on IRC, this is what ended up happening. I will work on making a separate, proper patch for the in-memory Message-ID cache that has a time to live associated with each entry.
Ben
-- Brought to you by the letters H and X and the number 19. "It is sad. *Campers* cannot *dance*. Not even a *party*." Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/

"Marc" == Marc MERLIN <marc_news@vasoftware.com> writes:
Marc> It took all of my sunday, but I just finished porting Ben
Marc> Gertzfield's excellent dupe removal patch to mailman cvs (I
Marc> also had to learn some python in the process. I'm starting
Marc> to believe that Mailman is a conspiracy to get people to
Marc> learn python :-p)
Fantastic, Marc! Sorry I've been lazy and haven't been able to port it. I'll install this patch on our test server at work and let you know if I have any problems tomorrow.
Ben
-- Brought to you by the letters O and T and the number 11. "You forgot Uranus." "Goooooooooodnight everybody!" Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/

"MM" == Marc MERLIN <marc_news@vasoftware.com> writes:
MM> Considering this was a pain to port, and how this puts to rest
MM> many of the reply-to munging discussions (the only real
MM> argument for reply-to munging is that it "solves" the
MM> duplicate mails you other receive when people use reply to
MM> all), I'm hoping that this could make it in (wink, wink :-D)
I'm looking at the patches now, and will have some feedback shortly (which might be in the form of checkin messages :).
-Barry

On Mon, Mar 04, 2002 at 02:25:09AM -0800, Marc MERLIN wrote:
Now comes the question: how do I retroactively reset user options for a given list of users without clicking on a web form?
The idea would be to enable nodupes for a batch of users, although I've also had the need to set a batch of users to notmetoo
I seemed to remember reading something about this, but I can't find the message.
Thanks, Marc
Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key

So, was there a way to do that, or do I need to write a wrapper with withlist?
Thanks, Marc
Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key

"MM" == Marc MERLIN <marc_news@vasoftware.com> writes:
MM> It took all of my sunday, but I just finished porting Ben
MM> Gertzfield's excellent dupe removal patch to mailman cvs (I
MM> also had to learn some python in the process. I'm starting to
MM> believe that Mailman is a conspiracy to get people to learn
MM> python :-p)
Well, of course it is! :)
Okay, I've looked over all the code. Except for some stylistic issues, which I'll just correct as I go, my biggest concern is the database used in AvoidDuplicates.py.
It looks like you're keeping an in-memory dictionary mapping addresses to a set of Message-ID:'s. You use this to decide if the recipient address has already received a message of the given Message-ID:
Let's ignore the duplicate or missing Message-ID: issue for now. The biggest problem I see is that 1) you lose all the mappings if you restart your IncomingRunner, and 2) your process will grow without bounds until you do restart your IncomingRunner.
I'm not sure about the best thing to do. Sticking this data structure in the list, or otherwise making it persistent, could take too much resources for not much gain. The second issue is more important, especially given that all our runners are now long running processes, and I think most of the unbounded memory growth issues are taken care of. Probably the best thing to do is to evict any entry in the dictionary that's older than a day or two.
Then again, this whole data structure seems intended to avoid duplicates when lists are crossposted. It shouldn't be necessary if we just want to filter out duplicates to explicitly named recipients. Maybe we don't need both features, as the former seems to be much less requested than the latter?
I think what I'll do for now is code up and test the original approach. I'm on irc now so please join me if you want to talk about it.
now-if-i-can-just-get-OPN-to-stop-kicking-me-off!-ly y'rs, -Barry

On Mon, Mar 04, 2002 at 05:16:21PM -0500, Barry A. Warsaw wrote:
Yeah, I looked at that too, but being tired, I didn't get as critical as you did. I figured it somehow worked ok for Ben for his lists (bad Marc, no cookie) Now that I think of it, Ben wrote initially wrote this for qrunner from cron, and didn't think about this issue when he ported it to mailman-cvs-sept-2001
That was my understanding of the code too.
That's probably not a problem because
- it would only affect a message being processed at the time you kill and restart IncomingRunner, not very likely, and worst case, you do get a second copy.
- You don't restart IncomingRunner often if at all
- When you do restart qrunner, there can be other qwirks, like a message being delivered twice (I've seen this with VERP enabled, I probably killed it while it was delivering a batch to exim, so it didn't complete and did it all over again after the restart)
I think you're right. You'd have to have a lot of traffic before it catches up with you, but it will eventually if you never restart qrunner.
That sounds like a reasonable plan.
That's true. The later is nice for instance when you have threads Cced accross mailman-devel and mailman-users, but having the former by itself would be good already. If this is a time issue wrt fixing the code, the duplicate message-ID code could be left behind a global option that is disabled by default and a comment saying that you should be ready to restart your qrunner weekly or daily if you enable it That said, adding a timestamp to them, and deleting everything that's more than 1H old is a better solution.
<-- barry has quit (Read error: 110 (Connection timed out))
:-)
I was saying there: about OPN, I don't know the details (and there were also politics), but the sf.net guys moved their channels to slashnet.org. Seems to work better there
Marc
Microsoft is to operating systems & security .... .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | Finger marc_f@merlins.org for PGP key

"Marc" == Marc MERLIN <marc_news@vasoftware.com> writes:
Barry> Let's ignore the duplicate or missing Message-ID: issue for
Barry> now. The biggest problem I see is that 1) you lose all the
Barry> mappings if you restart your IncomingRunner ...
Marc> That's probably not a problem because 1) it would only
Marc> affect a message being processed at the time you kill and
Marc> restart IncomingRunner, not very likely, and worst case, you
Marc> do get a second copy. 2) You don't restart IncomingRunner
Marc> often if at all 3) When you do restart qrunner, there can be
Marc> other qwirks, like a message being delivered twice (I've
Marc> seen this with VERP enabled, I probably killed it while it
Marc> was delivering a batch to exim, so it didn't complete and
Marc> did it all over again after the restart)
Marc has summed up all my comments here -- the worst thing that can happen if IncomingRunner is restarted is that a duplicate is sent, which is what we do now without the patch.
Barry> 2) your process will grow without bounds until you do
Barry> restart your IncomingRunner.
Marc> I think you're right. You'd have to have a lot of traffic
Marc> before it catches up with you, but it will eventually if you
Marc> never restart qrunner.
Yeah.. I knew about this, but I think my setup had a cron job to restart the runner daily. Not at all an optimal solution, just a hack to re-implement the /etc/aliases style list functionality where a user belonging to multiple umbrella lists only receives one copy of any given mail.
Barry> I'm not sure about the best thing to do. Sticking this
Barry> data structure in the list, or otherwise making it
Barry> persistent, could take too much resources for not much
Barry> gain. The second issue is more important, especially given
Barry> that all our runners are now long running processes, and I
Barry> think most of the unbounded memory growth issues are taken
Barry> care of. Probably the best thing to do is to evict any
Barry> entry in the dictionary that's older than a day or two.
Marc> That sounds like a reasonable plan.
So, is there functionality in the *Runners to run something on a regular schedule? Say, if we clean out the structure once an hour or so, it should work pretty well.
Barry> Then again, this whole data structure seems intended to
Barry> avoid duplicates when lists are crossposted. It shouldn't
Barry> be necessary if we just want to filter out duplicates to
Barry> explicitly named recipients. Maybe we don't need both
Barry> features, as the former seems to be much less requested
Barry> than the latter?
Marc> That's true. The later is nice for instance when you have
Marc> threads Cced accross mailman-devel and mailman-users, but
Marc> having the former by itself would be good already.
I agree, and from what I understand on IRC, this is what ended up happening. I will work on making a separate, proper patch for the in-memory Message-ID cache that has a time to live associated with each entry.
Ben
-- Brought to you by the letters H and X and the number 19. "It is sad. *Campers* cannot *dance*. Not even a *party*." Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/
participants (3)
-
barry@zope.com
-
Ben Gertzfield
-
Marc MERLIN