[Mailman-Developers] Updated dupe removal patch
Barry A. Warsaw
Mon, 4 Mar 2002 17:16:21 -0500
>>>>> "MM" == Marc MERLIN <email@example.com> writes:
MM> It took all of my sunday, but I just finished porting Ben
MM> Gertzfield's excellent dupe removal patch to mailman cvs (I
MM> also had to learn some python in the process. I'm starting to
MM> believe that Mailman is a conspiracy to get people to learn
MM> python :-p)
Well, of course it is! :)
Okay, I've looked over all the code. Except for some stylistic
issues, which I'll just correct as I go, my biggest concern is the
database used in AvoidDuplicates.py.
It looks like you're keeping an in-memory dictionary mapping
addresses to a set of Message-ID:'s. You use this to decide if the
recipient address has already received a message of the given
Let's ignore the duplicate or missing Message-ID: issue for now. The
biggest problem I see is that 1) you lose all the mappings if you
restart your IncomingRunner, and 2) your process will grow without
bounds until you do restart your IncomingRunner.
I'm not sure about the best thing to do. Sticking this data structure
in the list, or otherwise making it persistent, could take too much
resources for not much gain. The second issue is more important,
especially given that all our runners are now long running processes,
and I think most of the unbounded memory growth issues are taken care
of. Probably the best thing to do is to evict any entry in the
dictionary that's older than a day or two.
Then again, this whole data structure seems intended to avoid
duplicates when lists are crossposted. It shouldn't be necessary if
we just want to filter out duplicates to explicitly named recipients.
Maybe we don't need both features, as the former seems to be much less
requested than the latter?
I think what I'll do for now is code up and test the original
approach. I'm on irc now so please join me if you want to talk about