[Mailman-Developers] Improving the archives

Fri Jul 20 14:02:34 CEST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 4, 2007, at 3:49 AM, Stephen J. Turnbull wrote:

>> Along with that, I would really like to come up with an algorithm for
>> calculating those urls without talking to the archiver.
>
> Brad didn't like this when I suggested it before, but I didn't really
> understand why not.  Anyway, FWIW:
>
> I suggest adding an X-List-Received-ID header to all messages.  I
> haven't really thought through whether the UUID in that field should
> be at least partly human-readable or not, but that doesn't matter for
> the basic idea.[1]  The on-disk directory format would be
>
> /path-to-archive/private/my-list/Message-ID
>
> for singletons (Message-ID is the author-supplied ID) and
>
> /path-to-archive/private/my-list/Message-ID/List-Received-ID
>
> for multiples.  These would be created on-the-fly when they occur.
> They can be served as static pages.  For almost all messages, the bare
> URL
>
> http://archives.example.com/my-list/Message-ID
>
> should Just Work (ie, return a no-such-object result or a single
> message).  Where it does not, you get an index of all pages with that
> message ID.

I think this suggestion has merit, but I'm going to riff on it a bit.

First, I want to avoid talking about file system layout.  To me,  
that's an implementation detail we needn't worry about right now.   
Maybe the files will live on disk, maybe they'll live in a database,  
maybe they'll live in an external system we don't control.  I don't  
care.  What I want is a uniform way to calculate an address for a  
message given nothing but its text and an interface for retrieving  
messages from a service given that address.  I'm thinking about this  
in a RESTful way, and it's perfectly legitimate for that 'message  
address' to be relative to some archive or message store root.

I've done some experiments.  I took the top 5 mbox files on  
python.org and ran them through a script that looked for message-id  
collisions.  Then I implemented 6 strategies for looking at whether  
the collisions were true collisions or duplicates.    Duplicates are  
defined where every message in the same message-id bucket has the  
same match criteria, and collisions are where at least one message in  
the bucket is different.  So for example, with strategy 2, if the  
message-id and date headers are the same for every message in the  
bucket, it's a dupe, otherwise it's a collision.

While I ran the script over each mbox separately, I think it's more  
interesting to talk about them as a whole collection.  I don't really  
know how representative this would be of the world at large, but it's  
interesting anyway.  FTR, the lists were mailman-users, python-dev,  
python-help, python-list, and tutor.  I think there would be little  
intentional cross-posting between these lists.  Here are the numbers:

total 325146, missing: 624
1. msg.as_string(), dup: 34 (0.0104568409268%), col: 914  
(0.281104488445%)
2. message-id + date, dup: 875 (0.269109876794%), col: 73  
(0.0224514525782%)
3. message-id + 1st received, dup: 270 (0.0830396191249%), col: 678  
(0.208521710247%)
4. message-id + all received, dup: 270 (0.0830396191249%), col: 678  
(0.208521710247%)
5. message-id + date + 1st received, dup: 268 (0.0824245108351%),  
col: 680 (0.209136818537%)
6. body_line_iterator(msg), dup: 659 (0.202678181494%), col: 289  
(0.0888831478782%)

Notice that of 325146 total messages, 624 of them had no message-id  
header.  Even if you aggregate dup+col, you're still looking at a  
total duplicate rate of 0.29%.  While I'm almost tempted to ignore a  
hit rate that low, if you think of an archive holding 1B messages,  
you still get a lot of duplicates.

OTOH, the rate goes down even lower if you consider the message-id  
and date headers.  (Note, I did not consider messages missing a date  
header).  How likely is it that two messages with the same message-id  
and date are /not/ duplicates?  Heck, at that point, I'd feel  
justified in simply automatically rejecting the duplicate and  
chucking it from the archive.

I spent a /little/ time looking at the physical messages that ended  
up as true collisions.  Though by no means did I look at them all,  
they all looked related.  For example, with strategy 2 some messages  
look like they'd been inadvertently sent before they were completed.   
I need to see if there's any similarities in MUA behind these, but  
again, I think we might be able to safely assume that collisions on  
message-id+date can be ignored.

That leads me to the following proposal, which is just an elaboration  
on Stephen's. First, all messages live in the same namespace; they  
are not divided by target mailing list.  Each message has two  
addresses, one is the Message-ID and one is the base32 of the sha1  
hash of the Message-ID + Date.  As Stephen proposes, Mailman would  
add these headers if an incoming message is missing them, and tough  
luck for the non-list copy.  The nice thing is that RFC 2822 requires  
the Date header and states that Message-ID SHOULD be present.

Why the second address?  First, it provides as close to a guaranteed  
unique identifier as we can expect, and second because it produces a  
nearly human readable format.  For example, Stephen's OP would have a  
second address of

 >>> mid
'<87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp>'
 >>> date
'Wed, 04 Jul 2007 16:49:58 +0900'
 >>> # XXX perhaps strip off angle brackets
 >>> h = hashlib.sha1(mid)
 >>> h.update(date)
 >>> base64.b32encode(h.digest())
'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI'

I like base32 instead of base64 because the more limited alphabet  
should produce less ambiguous strings in certain fonts and I don't  
think the short b64 strings are short enough to justify the  
punctuation characters that would result.  While RFC 3548 specifies  
the b32 alphabet as using uppercase characters, I think any service  
that accepts b32 ids should be case insensitive.  A really Postel-y  
service could even accept '1' for 'I' and '0' for 'O' just to make it  
more resilient to human communication errors.

I'd like to come up with a good name for this second address, which  
would suggest the name of the X- header we stash this value in.  X- 
B32-Message-ID isn't very sexy.  Maybe X-Message-Global-ID, since I  
think there's a reasonable argument to make that for well-behaved  
messages, that's exactly what this is.

So now, think of the interface to a message store that supports this  
addressing scheme.  Well it's something like:

class MessageStore(Interface):
     def store_message(message):
         """Store the message.

         :raises ValueError: when the message is missing either the  
Message-ID
         header or a Date header.
         :raises DuplicateMessageError: when a message in the store  
already has
         a matching Message-ID and Date.  An archive is free to raise  
this exception
         for duplicate Message-IDs alone.
         """

     def get_message_by_global_id(key):
         """Locate and return the message from the store that matches  
`key`.

         :param key: The Global ID of the message to locate.  This is  
the
         base32 encoded SHA1 hash of the message's Message-ID and Date
         headers.
         :returns: The message object matching the Global ID, or None  
if there
         is no such match.
         """

     def get_messages_by_message_id(key):
         """Return the set of messages with a matching Message-ID `key`.

         :param key: The Message-ID of the messages to locate.
         :returns: The set of all messages in this store that have  
the given
         Message-ID.  If none such matches are found, the empty set is
         returned.
         """

As far as generating pages based on the Message-ID or global id, I  
agree with Stephen's proposal.  A page returned in response to a  
message-id request could return the message page or it could return  
an index of such messages.  It would be up to the archive whether it  
would accept duplicate Message-IDs or not, but it would always be  
guaranteed that a page returned in response to a global id request  
would return one email message.

Urls could be calculated by concatenating the List-Archive and X- 
Global-Message-ID headers, e.g.

http://mail.python.org/pipermail/mailman-developers/ 
RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

would be the OP.  This could point to the same resource as

http://mail.python.org/pipermail/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
http://mail.python.org/pipermail/global/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

and /might/ point to the same resource as:

http://mail.python.org/pipermail/mailman-developers/ 
87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp
http://mail.python.org/pipermail/mids/ 
87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp

> A minor drawback to my proposal is that if a message gets archived as
> a singleton for that Message-ID, then a duplicate arrives, previously
> created references in the archive will of course now return an index
> rather than the desired message.  Ie, there is data corruption.  This
> can be dealt with in several ways; the easiest would be to provide a
> "if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking- 
> for-me"
> link when creating the directory for multiple instances.

Or by using the global id, or by rejecting messages with duplicate  
message ids.

> There's also a *very* minor benefit: repeat sends will be immediately
> recognizable without checking Message-ID.
>
> Footnotes:
> [1]  By partly human-readable I mean containing list-id and date
> information.  The idea would be to have the date come first, so that
> users would have a shot at identifying which of several messages is
> most likely, and this would be searchable by eye with simply an
> ordinary sorted index.

I see searching, indexing, sorting, and providing other human  
readable urls into the message store as a function of the archive.   
Once you're looking at a link to the actual message, you're going to  
be looking at a url that contains the global id, regardless of the  
number of levels you have to go through or redirects involved.

Apologies for letting this thread linger so long.  I'm very  
interesting in hearing your thoughts and if there's general  
agreement, I'll write it up in the wiki.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCkWnEjvBPtnXfVAQIRhAP7BkuF5K0xOuie2GBqOOWDarksD5Oy49y9
/2WO+u4xH+BttIt3adHJS+K6ETYcK79c5Rf4uwZk40DqWKK7ay1zkxUn/LGXOJ0o
CoWQG5ZyFUJUTkDXtxEWcZ8kkXaDTTSNz2eCtYgQAXw77A95E1SjV0YBs54bFK3A
Bi9cjrKRDcM=
=pyY6
-----END PGP SIGNATURE-----