suppress duplicate when posting addressed to list and its alias name
Many of our lists have full addresses in the form of 'foo-bar@baz.org', with corresponding alias names 'bar@baz.org'.
When someone sends a message To:foo-bar@baz.org and also Cc:bar@baz.org, his or her MUA sends two separate messages to the list which transmits these duplicates to list members. However, only one copy makes it to the pipermail HTML archives.
Is there a Mailman way to activate, in the context of delivery, the same duplicate suppression that occurs when archiving? If not, I will have to do this based on Message-IDs on input at the MTA, but I would prefer to do it via the mailing list manager if this is already possible.
Thanks,
Sahil Tandon
Sahil Tandon wrote:
Is there a Mailman way to activate, in the context of delivery, the same duplicate suppression that occurs when archiving?
No. The archiver has intimate knowledge of message-ids because they are used in message threading so it knows when a message has a duplicate message-id. Note that it doesn't actually ignore the duplicate. It creates an nnnnnn.html archive file containing the message and it adds it to the cumulative .mbox file and (I think, I haven't checked) the periodic .txt file; it just doesn't link to it from any of the index files.
If not, I will have to do this based on Message-IDs on input at the MTA, but I would prefer to do it via the mailing list manager if this is already possible.
It would be possible to implement a per-list database of processed message-ids with a custom handler very early in the pipeline, and discard duplicates there. See <http://wiki.list.org/x/l4A9>.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Mon, 2012-11-05 at 16:51:21 -0800, Mark Sapiro wrote:
... It would be possible to implement a per-list database of processed message-ids with a custom handler very early in the pipeline, and discard duplicates there. See <http://wiki.list.org/x/l4A9>.
Thanks Mark, this seems like the ideal approach. I'll try to hack something together borrowing from the various handlers (namely AvoidDuplicates.py) that are already in use. If I can understand how Mailman keeps the in-memory dictionary of Message-IDs mentioned in AvoidDuplicates.py, and implement an analogue for our use-case, that would do it. The goal is to check whether a tuple of (message-id, listname) already exists in the dict and, if it does, raise Errors.DiscardMessage; otherwise, add the tuple to the dict and do nothing.
-- Sahil Tandon
Sahil Tandon wrote:
Thanks Mark, this seems like the ideal approach. I'll try to hack something together borrowing from the various handlers (namely AvoidDuplicates.py) that are already in use.
Actually, AvoidDuplicates.py ccould serve as a good example, but it is currently not actually used. It is experimental and is bot included in the default GLOBAL_PIPELINE.
If I can understand how Mailman keeps the in-memory dictionary of Message-IDs mentioned in AvoidDuplicates.py, and implement an analogue for our use-case, that would do it.
The major problem with keeping these data in-memory other than purging "old" entries so that the dictionary doesn't grow too large, is that in-memory data aren't shared between runners so if the incoming queue is sliced, the multiple copies of IncomingRunner do not have access to each other's data.
In your case, the input to the hash on which runners are sliced includes all the message headers and the listname so it is likely that the "equivalent but different" listname messages will be in different slices of the hash space.
This is not a concern if IncomingRunner is not sliced. It is also not a concern with a disk based cache as long as buffers are flushed after writing because IncomingRunner locks the list whose message is being processed which should prevent race conditions between different slices of IncomingRunner.
The goal is to check whether a tuple of (message-id, listname) already exists in the dict and, if it does, raise Errors.DiscardMessage; otherwise, add the tuple to the dict and do nothing.
I would make a dictionary keyed on message-id + the cannonical listname with value = the time seen. Then I could just check if the key for the current message exists and proceed as above, and I also have time stamps so I can periodically remove old entries.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
Mark Sapiro wrote:
Actually, AvoidDuplicates.py ccould serve as a good example, but it is currently not actually used. It is experimental and is bot included in the default GLOBAL_PIPELINE.
I seem to be having more than my usual problems with typing/proofreading this morning. The above paragraph should say
Actually, AvoidDuplicates.py could serve as a good example, but it is currently not actually used. It is experimental and is not included in the default GLOBAL_PIPELINE.
More importantly, I was confused. AvoidDuplicates.py does appear in the GLOBAL_PIPELINE, but it does not do what its docstring says it does. All it does in its current form is not send to list recipients who are explicitly addressed in a To: or Cc: header of the message and you have selected the nodups option for the list.
Thus, there is no "in-memory dictionary of Message-ID: and recipient pairs" and no testing to see if a recipient has already received another copy of the message from Mailman. The intent of the feature described in the docstring is to enable the elimination of sending multiple copies of messages which were cross-posted to multiple lists to a recipient who is a member of more than one of those lists. This feature was never successfully implemented.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Tue, 2012-11-06 at 11:26:40 -0800, Mark Sapiro wrote:
Actually, AvoidDuplicates.py ccould serve as a good example, but it is currently not actually used. It is experimental and is bot included in the default GLOBAL_PIPELINE.
As you noted in your follow-up, the docstring does not at all describe what that handler actually does. I learned this when actually stepping through the code. :)
The major problem with keeping these data in-memory other than purging "old" entries so that the dictionary doesn't grow too large, is that in-memory data aren't shared between runners so if the incoming queue is sliced, the multiple copies of IncomingRunner do not have access to each other's data.
In your case, the input to the hash on which runners are sliced includes all the message headers and the listname so it is likely that the "equivalent but different" listname messages will be in different slices of the hash space.
This is not a concern if IncomingRunner is not sliced. It is also not a concern with a disk based cache as long as buffers are flushed after writing because IncomingRunner locks the list whose message is being processed which should prevent race conditions between different slices of IncomingRunner.
Then, would it make sense (or be overkill) to have the handler populate a dict of key, value = message-id, timestamp? And, store that dict in a pickle whose filename is derived from mlist.internal_name()?
Obviously, this would result in a lot of pickles that are constantly opened, edited (and, periodically cleansed), and closed. Is the performance cost/benefit prohibitive? I would also be relying on the fact that a handler is never concurrently called for the same list -- is that understanding accurate? -- which avoids the scenario in which we are trying to simultaneously manipulate the same pickle.
-- Sahil Tandon
Sahil Tandon wrote:
On Tue, 2012-11-06 at 11:26:40 -0800, Mark Sapiro wrote:
In your case, the input to the hash on which runners are sliced includes all the message headers and the listname so it is likely that the "equivalent but different" listname messages will be in different slices of the hash space.
This is not a concern if IncomingRunner is not sliced. It is also not a concern with a disk based cache as long as buffers are flushed after writing because IncomingRunner locks the list whose message is being processed which should prevent race conditions between different slices of IncomingRunner.
Then, would it make sense (or be overkill) to have the handler populate a dict of key, value = message-id, timestamp? And, store that dict in a pickle whose filename is derived from mlist.internal_name()?
Obviously, this would result in a lot of pickles that are constantly opened, edited (and, periodically cleansed), and closed. Is the performance cost/benefit prohibitive?
Whether the cost is prohibitive depends on how many messages per minute, hour, day, etc you process through the list. I think it could work. The 'in-memory dictionary would also work as long as you are running with the default single qrunner per queue except for the rare case where the duplicates are processed one on each side of a restart.
Note as an implementation for the file name (path) derived from the list's internal_name, I would just use a fixed file name, e.g., message-ids.pck in the existing lists/internal_name()/ directory.
I would also be relying on the fact that a handler is never concurrently called for the same list -- is that understanding accurate? -- which avoids the scenario in which we are trying to simultaneously manipulate the same pickle.
Yes, that is accurate. IncomingRunner locks the list before processing the pipeline and doesn't unlock it until it's done, so processing of the pipeline for a given message and list is complete before any other runner can begin processing a message for that list.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Tue, 2012-11-06 at 18:02:27 -0800, Mark Sapiro wrote:
Sahil Tandon wrote: ...
Obviously, this would result in a lot of pickles that are constantly opened, edited (and, periodically cleansed), and closed. Is the performance cost/benefit prohibitive?
Whether the cost is prohibitive depends on how many messages per minute, hour, day, etc you process through the list. I think it could work. The 'in-memory dictionary would also work as long as you are running with the default single qrunner per queue except for the rare case where the duplicates are processed one on each side of a restart.
Note as an implementation for the file name (path) derived from the list's internal_name, I would just use a fixed file name, e.g., message-ids.pck in the existing lists/internal_name()/ directory.
Thanks; this custom handler is now in testing and working well so far. I appreciate your help.
-- Sahil Tandon
participants (2)
-
Mark Sapiro
-
Sahil Tandon