Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking at a total duplicate rate of 0.29%.
Message ID's are supposed to be unique. This is discussed in in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places. If that's not the case, the mail transfer agent is broken. I think it's better to go ahead and use the mesage-id, rather than concoct yet another "this time we mean it!" unique identifier. This is a cost/benefit thing; the cost is some real world collisions, the benefit is a conceptually simpler system. Conceptually simpler things are good especially when implemented all over the place.
Which brings me to suggestion #2, which is go ahead and write an RFC on how list servers should embed archival links in messages. This sounds like an internet wide interoperability issue as much as something mailman specific. Why not come up with a scheme usable by all list servers? And also describe a specification third party archival services can comply to. Besides, I've always wanted to help write an RFC. If we go that route, it would be good to get input from a range of people - one person I'd suggest is Earl Hood, author of mhonarc.
Thoughts?
Jeff
While I'm almost tempted to ignore a
hit rate that low, if you think of an archive holding 1B messages, you still get a lot of duplicates.
OTOH, the rate goes down even lower if you consider the message-id and date headers. (Note, I did not consider messages missing a date header). How likely is it that two messages with the same message-id and date are /not/ duplicates? Heck, at that point, I'd feel justified in simply automatically rejecting the duplicate and chucking it from the archive.
I spent a /little/ time looking at the physical messages that ended up as true collisions. Though by no means did I look at them all, they all looked related. For example, with strategy 2 some messages look like they'd been inadvertently sent before they were completed. I need to see if there's any similarities in MUA behind these, but again, I think we might be able to safely assume that collisions on message-id+date can be ignored.
That leads me to the following proposal, which is just an elaboration on Stephen's. First, all messages live in the same namespace; they are not divided by target mailing list. Each message has two addresses, one is the Message-ID and one is the base32 of the sha1 hash of the Message-ID + Date. As Stephen proposes, Mailman would add these headers if an incoming message is missing them, and tough luck for the non-list copy. The nice thing is that RFC 2822 requires the Date header and states that Message-ID SHOULD be present.
Why the second address? First, it provides as close to a guaranteed unique identifier as we can expect, and second because it produces a nearly human readable format. For example, Stephen's OP would have a second address of
mid '87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp' date 'Wed, 04 Jul 2007 16:49:58 +0900' # XXX perhaps strip off angle brackets h = hashlib.sha1(mid) h.update(date) base64.b32encode(h.digest()) 'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI'
I like base32 instead of base64 because the more limited alphabet should produce less ambiguous strings in certain fonts and I don't think the short b64 strings are short enough to justify the punctuation characters that would result. While RFC 3548 specifies the b32 alphabet as using uppercase characters, I think any service that accepts b32 ids should be case insensitive. A really Postel-y service could even accept '1' for 'I' and '0' for 'O' just to make it more resilient to human communication errors.
I'd like to come up with a good name for this second address, which would suggest the name of the X- header we stash this value in. X- B32-Message-ID isn't very sexy. Maybe X-Message-Global-ID, since I think there's a reasonable argument to make that for well-behaved messages, that's exactly what this is.
So now, think of the interface to a message store that supports this addressing scheme. Well it's something like:
class MessageStore(Interface): def store_message(message): """Store the message.
:raises ValueError: when the message is missing either the
Message-ID header or a Date header. :raises DuplicateMessageError: when a message in the store already has a matching Message-ID and Date. An archive is free to raise this exception for duplicate Message-IDs alone. """
def get_message_by_global_id(key): """Locate and return the message from the store that matches
key
.:param key: The Global ID of the message to locate. This is
the base32 encoded SHA1 hash of the message's Message-ID and Date headers. :returns: The message object matching the Global ID, or None if there is no such match. """
def get_messages_by_message_id(key): """Return the set of messages with a matching Message-ID `key`. :param key: The Message-ID of the messages to locate. :returns: The set of all messages in this store that have
the given Message-ID. If none such matches are found, the empty set is returned. """
As far as generating pages based on the Message-ID or global id, I agree with Stephen's proposal. A page returned in response to a message-id request could return the message page or it could return an index of such messages. It would be up to the archive whether it would accept duplicate Message-IDs or not, but it would always be guaranteed that a page returned in response to a global id request would return one email message.
Urls could be calculated by concatenating the List-Archive and X- Global-Message-ID headers, e.g.
http://mail.python.org/pipermail/mailman-developers/ RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
would be the OP. This could point to the same resource as
http://mail.python.org/pipermail/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI http://mail.python.org/pipermail/global/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
and /might/ point to the same resource as:
http://mail.python.org/pipermail/mailman-developers/ 87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp http://mail.python.org/pipermail/mids/ 87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp
A minor drawback to my proposal is that if a message gets archived as a singleton for that Message-ID, then a duplicate arrives, previously created references in the archive will of course now return an index rather than the desired message. Ie, there is data corruption. This can be dealt with in several ways; the easiest would be to provide a "if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking- for-me" link when creating the directory for multiple instances.
Or by using the global id, or by rejecting messages with duplicate message ids.
There's also a *very* minor benefit: repeat sends will be immediately recognizable without checking Message-ID.
Footnotes: [1] By partly human-readable I mean containing list-id and date information. The idea would be to have the date come first, so that users would have a shot at identifying which of several messages is most likely, and this would be searchable by eye with simply an ordinary sorted index.
I see searching, indexing, sorting, and providing other human readable urls into the message store as a function of the archive. Once you're looking at a link to the actual message, you're going to be looking at a url that contains the global id, regardless of the number of levels you have to go through or redirects involved.
Apologies for letting this thread linger so long. I'm very interesting in hearing your thoughts and if there's general agreement, I'll write it up in the wiki.
- -Barry
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)
iQCVAwUBRqCkWnEjvBPtnXfVAQIRhAP7BkuF5K0xOuie2GBqOOWDarksD5Oy49y9 /2WO+u4xH+BttIt3adHJS+K6ETYcK79c5Rf4uwZk40DqWKK7ay1zkxUn/LGXOJ0o CoWQG5ZyFUJUTkDXtxEWcZ8kkXaDTTSNz2eCtYgQAXw77A95E1SjV0YBs54bFK3A Bi9cjrKRDcM= =pyY6 -----END PGP SIGNATURE-----
Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/jeff%40jab.org
Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp