[Mailman-Developers] Improving the archives
jeff at jab.org
Tue Jul 24 08:02:46 CEST 2007
> Notice that of 325146 total messages, 624 of them had no message-id
> header. Even if you aggregate dup+col, you're still looking at a
> total duplicate rate of 0.29%.
Message ID's are supposed to be unique. This is discussed in
in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places.
If that's not the case, the mail transfer agent is broken. I think it's
better to go ahead and use the mesage-id, rather than concoct
yet another "this time we mean it!" unique identifier. This is a
cost/benefit thing; the cost is some real world collisions, the benefit
is a conceptually simpler system. Conceptually simpler things are
good especially when implemented all over the place.
Which brings me to suggestion #2, which is go ahead and write
an RFC on how list servers should embed archival links in messages.
This sounds like an internet wide interoperability issue as much as
something mailman specific. Why not come up with a scheme usable
by all list servers? And also describe a specification third party archival
services can comply to. Besides, I've always wanted to help write
an RFC. If we go that route, it would be good to get input from a range
of people - one person I'd suggest is Earl Hood, author of mhonarc.
While I'm almost tempted to ignore a
> hit rate that low, if you think of an archive holding 1B messages,
> you still get a lot of duplicates.
> OTOH, the rate goes down even lower if you consider the message-id
> and date headers. (Note, I did not consider messages missing a date
> header). How likely is it that two messages with the same message-id
> and date are /not/ duplicates? Heck, at that point, I'd feel
> justified in simply automatically rejecting the duplicate and
> chucking it from the archive.
> I spent a /little/ time looking at the physical messages that ended
> up as true collisions. Though by no means did I look at them all,
> they all looked related. For example, with strategy 2 some messages
> look like they'd been inadvertently sent before they were completed.
> I need to see if there's any similarities in MUA behind these, but
> again, I think we might be able to safely assume that collisions on
> message-id+date can be ignored.
> That leads me to the following proposal, which is just an elaboration
> on Stephen's. First, all messages live in the same namespace; they
> are not divided by target mailing list. Each message has two
> addresses, one is the Message-ID and one is the base32 of the sha1
> hash of the Message-ID + Date. As Stephen proposes, Mailman would
> add these headers if an incoming message is missing them, and tough
> luck for the non-list copy. The nice thing is that RFC 2822 requires
> the Date header and states that Message-ID SHOULD be present.
> Why the second address? First, it provides as close to a guaranteed
> unique identifier as we can expect, and second because it produces a
> nearly human readable format. For example, Stephen's OP would have a
> second address of
> >>> mid
> '<87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp>'
> >>> date
> 'Wed, 04 Jul 2007 16:49:58 +0900'
> >>> # XXX perhaps strip off angle brackets
> >>> h = hashlib.sha1(mid)
> >>> h.update(date)
> >>> base64.b32encode(h.digest())
> I like base32 instead of base64 because the more limited alphabet
> should produce less ambiguous strings in certain fonts and I don't
> think the short b64 strings are short enough to justify the
> punctuation characters that would result. While RFC 3548 specifies
> the b32 alphabet as using uppercase characters, I think any service
> that accepts b32 ids should be case insensitive. A really Postel-y
> service could even accept '1' for 'I' and '0' for 'O' just to make it
> more resilient to human communication errors.
> I'd like to come up with a good name for this second address, which
> would suggest the name of the X- header we stash this value in. X-
> B32-Message-ID isn't very sexy. Maybe X-Message-Global-ID, since I
> think there's a reasonable argument to make that for well-behaved
> messages, that's exactly what this is.
> So now, think of the interface to a message store that supports this
> addressing scheme. Well it's something like:
> class MessageStore(Interface):
> def store_message(message):
> """Store the message.
> :raises ValueError: when the message is missing either the
> header or a Date header.
> :raises DuplicateMessageError: when a message in the store
> already has
> a matching Message-ID and Date. An archive is free to raise
> this exception
> for duplicate Message-IDs alone.
> def get_message_by_global_id(key):
> """Locate and return the message from the store that matches
> :param key: The Global ID of the message to locate. This is
> base32 encoded SHA1 hash of the message's Message-ID and Date
> :returns: The message object matching the Global ID, or None
> if there
> is no such match.
> def get_messages_by_message_id(key):
> """Return the set of messages with a matching Message-ID `key`.
> :param key: The Message-ID of the messages to locate.
> :returns: The set of all messages in this store that have
> the given
> Message-ID. If none such matches are found, the empty set is
> As far as generating pages based on the Message-ID or global id, I
> agree with Stephen's proposal. A page returned in response to a
> message-id request could return the message page or it could return
> an index of such messages. It would be up to the archive whether it
> would accept duplicate Message-IDs or not, but it would always be
> guaranteed that a page returned in response to a global id request
> would return one email message.
> Urls could be calculated by concatenating the List-Archive and X-
> Global-Message-ID headers, e.g.
> would be the OP. This could point to the same resource as
> and /might/ point to the same resource as:
> 87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp
> 87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp
> > A minor drawback to my proposal is that if a message gets archived as
> > a singleton for that Message-ID, then a duplicate arrives, previously
> > created references in the archive will of course now return an index
> > rather than the desired message. Ie, there is data corruption. This
> > can be dealt with in several ways; the easiest would be to provide a
> > "if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking-
> > for-me"
> > link when creating the directory for multiple instances.
> Or by using the global id, or by rejecting messages with duplicate
> message ids.
> > There's also a *very* minor benefit: repeat sends will be immediately
> > recognizable without checking Message-ID.
> > Footnotes:
> >  By partly human-readable I mean containing list-id and date
> > information. The idea would be to have the date come first, so that
> > users would have a shot at identifying which of several messages is
> > most likely, and this would be searchable by eye with simply an
> > ordinary sorted index.
> I see searching, indexing, sorting, and providing other human
> readable urls into the message store as a function of the archive.
> Once you're looking at a link to the actual message, you're going to
> be looking at a url that contains the global id, regardless of the
> number of levels you have to go through or redirects involved.
> Apologies for letting this thread linger so long. I'm very
> interesting in hearing your thoughts and if there's general
> agreement, I'll write it up in the wiki.
> - -Barry
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.7 (Darwin)
> -----END PGP SIGNATURE-----
> Mailman-Developers mailing list
> Mailman-Developers at python.org
> Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
> Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/
> Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/jeff%40jab.org
> Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp
More information about the Mailman-Developers