[Mailman-Developers] Improving the archives

Tue Jul 24 18:31:55 CEST 2007

There are three different parties coming to the table. One is
the mail transfer agent of the sender, another is the list server,
and the third is the archive server. Ideally, all three will be happy
campers.

>So we just specify a header to put it in, and subscribers will be able
>to use it, per definition of a canonical URL.

It is the archive server's job to decide what is the "canonical" URL
for a message. There's a good chance these archival URLs will be
served by an HTTP redirect. So let's not use the word canonical. :)

>What complexity?  Mailman just does
>
>  msg['X-List-Archive-Received-ID'] = Email.msgid()

Easy to introduce, harder to deal with. The archival server would now
keep track of both the message-id and the x-list-archive-received-id.
That's two namespaces that almost do the same thing. It's easier
for the archive server to keep track of one name space than two,
and - most importantly - conceptually simpler.

>From the perspective of the assorted list servers, it's easier to
do nothing than to do something. So if they can get by with
just message-id (which is already implemented) not have to add
x-list-archive-received-id, that's a smoother implementation path.
If we base on message-id, archival servers will be able to
retroactively add support for all their stored messages, even those
that are ten years old. And users holding an old message will be
able to figure out that URL without doing any computational
gymnastics.

Put another way, there's the possibility to reduce the archive
servers' implementation to "search for this mesage-id" which is
something really useful to have anyway, and therefore likely to
get wider support.

In addition, Barry was talking about concocting a unique
identifier from the Date field and Message-ID. I'm not a big fan of
this idea, because the date field comes from the mail user agent
and is often wildly corrupt; e;g; coming from 100 years in the future.
Very painful if the archive is showing most recent message first.
Therefore an archival server is very likely to determine message date
from the most recent received header (generally from a trusted mail
transfer agent) rather than the date field. From the archive server's
perspective, the best thing to do with the date field is throw it away.

So for these reasons, I'd rather stick with message-id and risk
some real world collisions, instead of introduce another identifier.
If the list server receives a message with no message-id, by all means
create one on the spot.  To me, this feels like the sweet spot in terms
of cost benefit. The main thing that bugs me is message-ids are long,
which makes them awkward to embed in a URL in the footer of a
message.

Jeff