Search by Message-ID, preserving Cc for direct recipients

I would like to be able to search the archives of a mailman list using the Message-ID, ideally using a stable URL like
http://mid.gmane.org/${message_id} http://mail-archive.com/search?l=mid&q=${message_id}
but preferably on our own host as we're not currently mirrored and would rather link to our own archives when referencing on old discussion on the list. Our current archives (e.g., [1]) are searched using htdig, but it doesn't seem to support query by Message-ID. Your wiki page [2] also suggests Swish, MnoGoSearch, and Namazu. Can any of these search by Message-ID, or is our best bet to get indexed by mail-archive.com and direct people there?
Second question: Why are direct recipients dropped from the Cc header of the copy sent via the list? This seems partially addressed in the archives [3], but I think it's important for high-volume lists when people filter conversations based on whether they are a direct recipient. Is there an option somewhere to keep Cc headers intact without changing other behavior?
[1] http://lists.mcs.anl.gov/pipermail/petsc-dev/ [2] http://wiki.list.org/display/DOC/How+do+I+make+the+archives+searchable [3] http://mail.python.org/pipermail/mailman-developers/2006-May/018777.html

On 05/14/2013 10:17 AM, Jed Brown wrote:
The Message-ID of the post is in the HTML page containing the post, but it is only in an In-Reply-To= fragment of a mailto: URL that isn't indexed in htdig. Also, it's URL encoded so <, > and @ are %3C, %3E and %40 respectively. The actual Message-ID: headers are in the periodic *.txt files.
This leads to a few possibilities such as teaching htdig to index the .txt files (may be tricky, I just spent a couple of minutes looking at this and didn't see it), changing the noindex start and end tags in the list's archives/private/LIST/htdig/LIST.conf file so that everything in the HTML files including the URL encoded Message-ID is indexed or writing a separate CGI search script to search the .txt files for the Message-ID.
Or, use mail-archive.com which is probably simplest.
I've learned a lot in the last 7 years ;)
The reason is to keep the Cc: list from growing excessively long in long threads involving many people (see the subsequent post(s) in that thread).
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Mark Sapiro <mark@msapiro.net> writes:
Okay, thanks. I'll talk with the others here and decide what to do.
Yeah, I saw that, but I don't care how long the Cc list gets. I would rather allow people to filter aggressively and not worry about missing posts that may be relevant to them. It's common on other lists (evidently not those managed by mailman, vger.kernel.org is a high-profile example) to by convention, always Cc everyone that is likely to be interested. Asking recipients to write rules in terms of thread ancestry isn't sufficient either: when we later do more work that is somehow related, we might start a new thread and Cc everyone from prior threads that were related. If the list chronically drops Cc, it can be hard to figure out everyone that should be Cc'd in a new topic.
Anyway, can I interpret your response as being that mailman always drops Cc and there is no configuration option?

On 05/15/2013 12:47 PM, Jed Brown wrote:
Anyway, can I interpret your response as being that mailman always drops Cc and there is no configuration option?
I guess that depends on what you call a configuration option.
You could put this in mm_cfg.py
GLOBAL_PIPELINE.remove('AvoidDuplicates')
That would just remove the Handler so every list member that is a direct recipient would receive both the list and the direct copy regardless of her avoid duplicates setting, or you could apply the attached patch to Mailman/Handlers/AvoidDuplicates.py, or you could patch the module but name the patched module say Mailman/Handlers/MyAvoidDuplicates.py and put
GLOBAL_PIPELINE.insert(GLOBAL_PIPELINE.index('AvoidDuplicates'), 'MyAvoidDuplicates') GLOBAL_PIPELINE.remove('AvoidDuplicates')
im mm_cfg.py. See the FAQ at <http://wiki.list.org/x/l4A9>. Note: the first line is wrapped but it doesn't matter because of Python's implies continuation inside parens. Also note that this latter method is preferable to simple patching AvoidDuplicates.py for reasons mentioned in the FAQ.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

On 05/14/2013 10:17 AM, Jed Brown wrote:
The Message-ID of the post is in the HTML page containing the post, but it is only in an In-Reply-To= fragment of a mailto: URL that isn't indexed in htdig. Also, it's URL encoded so <, > and @ are %3C, %3E and %40 respectively. The actual Message-ID: headers are in the periodic *.txt files.
This leads to a few possibilities such as teaching htdig to index the .txt files (may be tricky, I just spent a couple of minutes looking at this and didn't see it), changing the noindex start and end tags in the list's archives/private/LIST/htdig/LIST.conf file so that everything in the HTML files including the URL encoded Message-ID is indexed or writing a separate CGI search script to search the .txt files for the Message-ID.
Or, use mail-archive.com which is probably simplest.
I've learned a lot in the last 7 years ;)
The reason is to keep the Cc: list from growing excessively long in long threads involving many people (see the subsequent post(s) in that thread).
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Mark Sapiro <mark@msapiro.net> writes:
Okay, thanks. I'll talk with the others here and decide what to do.
Yeah, I saw that, but I don't care how long the Cc list gets. I would rather allow people to filter aggressively and not worry about missing posts that may be relevant to them. It's common on other lists (evidently not those managed by mailman, vger.kernel.org is a high-profile example) to by convention, always Cc everyone that is likely to be interested. Asking recipients to write rules in terms of thread ancestry isn't sufficient either: when we later do more work that is somehow related, we might start a new thread and Cc everyone from prior threads that were related. If the list chronically drops Cc, it can be hard to figure out everyone that should be Cc'd in a new topic.
Anyway, can I interpret your response as being that mailman always drops Cc and there is no configuration option?

On 05/15/2013 12:47 PM, Jed Brown wrote:
Anyway, can I interpret your response as being that mailman always drops Cc and there is no configuration option?
I guess that depends on what you call a configuration option.
You could put this in mm_cfg.py
GLOBAL_PIPELINE.remove('AvoidDuplicates')
That would just remove the Handler so every list member that is a direct recipient would receive both the list and the direct copy regardless of her avoid duplicates setting, or you could apply the attached patch to Mailman/Handlers/AvoidDuplicates.py, or you could patch the module but name the patched module say Mailman/Handlers/MyAvoidDuplicates.py and put
GLOBAL_PIPELINE.insert(GLOBAL_PIPELINE.index('AvoidDuplicates'), 'MyAvoidDuplicates') GLOBAL_PIPELINE.remove('AvoidDuplicates')
im mm_cfg.py. See the FAQ at <http://wiki.list.org/x/l4A9>. Note: the first line is wrapped but it doesn't matter because of Python's implies continuation inside parens. Also note that this latter method is preferable to simple patching AvoidDuplicates.py for reasons mentioned in the FAQ.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
participants (2)
-
Jed Brown
-
Mark Sapiro