If you improve the script or find numbers that lead to different conclusions, now's the time to know!
Live and learn!
So I just looked at 2 million raw messages from 2007, spread over a few thousand mailing lists (all data is from mail-archive.com). My first question was - when comparing only with messages from the same list - how many times do I see a repeated message-id? The answer was ... drumroll please ... 260 thousand. What the hell?
Time for a closer look. In some cases, the archiver was getting two copies of every message. For example, the MLM (mailman) was sending out a message to subscriber A and subscriber B, and both paths eventually lead to the archiver.
In another case, the MLM (YahooGroups) spammed 20 copies of the same message to every subscriber, and modified the body of each one. YahooGroups tends create HTML mail and sticks ads, possibly spyware, and who knows what other crap in message footers.
There's probably other categories I haven't noticed yet, 260k messages is a lot of checking. So you'd think the archives would be a complete mess. But they aren't and I had no idea anything was remotely amiss under the hood. That's because mhonarc only archives one message per message-id. So those 19 repeats from YahooGroups get thown away. This is actually a pretty robust strategy when you think about it; it keeps lots of annoyances out of archives and everyone who gets smited deserves it; accidental duplicates, malicious duplicates, broken mail transfer agents. Reasonable people can disagree, but I like it.
So I'm amending my request. If mailman and pipermail++ want to keep a verbatim record of everything passing through the MLM, fine. But please make it also possible to interoperate with archivers that use the looser mhonarc strategy, e.g. allow the interoperability URL to collide when message-ids collide. Currently Stephen's proposal allows this, Barry's does not.
Just to make things really concrete, here's an example from that YahooGroups collision I was describing. The 20 messages spammed to subscribers would all have a interoperability URL something like this (but perhaps not quite so enormously long) embedded in the message, in both headers and possibly a footer.
Clicking on it, the user goes to the archive server. For this particular archiver, an HTTP 302 redirect takes the user to another URL which happens to be more human friendly. But the details of what alternate URLs are available - if any - is really up to the archive server.
I think that's about it. I do kind of like Stephen's suggestion of allowing the archiver to supply a formuia for interoperability URL; if that's the case I'd say the RFC2369 headers could be fair game for use in the calculation. That allows cross posted messages to easily link to their correct archive - note how I used the contents of List-Post when creating the interoperability URL above.