[Mailman-Developers] Improving the archives

Wed Jul 4 09:49:58 CEST 2007

Barry Warsaw writes:
 > > - archive links that won't break if the archive is rebuilt
 > 
 > Yes, this is absolutely critical, in fact, I'd put it right at the  
 > top of the list, even more so than a u/i overhaul.  Stable urls, with  
 > backward compatible redirecting links if at all possible, would be  
 > fantastic.

+1.  I've been wanting to do something about this, and have made
proposals (not back with code, mea maxima culpa) for design.  I would
definitely be happy to help with this, but given time constraints, it
would be nice if somebody else could take the lead.

 > Along with that, I would really like to come up with an algorithm for  
 > calculating those urls without talking to the archiver.

Brad didn't like this when I suggested it before, but I didn't really
understand why not.  Anyway, FWIW:

I suggest adding an X-List-Received-ID header to all messages.  I
haven't really thought through whether the UUID in that field should
be at least partly human-readable or not, but that doesn't matter for
the basic idea.[1]  The on-disk directory format would be

/path-to-archive/private/my-list/Message-ID

for singletons (Message-ID is the author-supplied ID) and

/path-to-archive/private/my-list/Message-ID/List-Received-ID

for multiples.  These would be created on-the-fly when they occur.
They can be served as static pages.  For almost all messages, the bare
URL

http://archives.example.com/my-list/Message-ID

should Just Work (ie, return a no-such-object result or a single
message).  Where it does not, you get an index of all pages with that
message ID.

The main drawback to using Message IDs that I can see is that broken
MUAs may supply no Message-ID, or the same one repeatedly.  In the
former case, as a last resort Mailman can supply one, but that won't
help people who get a personal copy and want to find the thread.
However, I see no way to help them, anyway, beyond a generic archive
search engine.  In the latter, you get lots of messages matching the
Message-ID, and while most lists should have *zero* problems, a list
that has any instances of this problem would have many.  Again I can't
see a good way to deal with this other than a general search facility,
as computing a digest of headers or content is hard to do reliably.
Providing an index of matching posts seems like a reasonable approach,
which can be efficiently implemented (eg, as static pages).
Furthermore, the examples I've seen of both in the last few years have
all been either spam or (in the case of duplicate Message-IDs) actual
duplicates due to some mail system problem or itchy user fingers.

A minor drawback to my proposal is that if a message gets archived as
a singleton for that Message-ID, then a duplicate arrives, previously
created references in the archive will of course now return an index
rather than the desired message.  Ie, there is data corruption.  This
can be dealt with in several ways; the easiest would be to provide a
"if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking-for-me"
link when creating the directory for multiple instances.

There's also a *very* minor benefit: repeat sends will be immediately
recognizable without checking Message-ID.

Footnotes: 
[1]  By partly human-readable I mean containing list-id and date
information.  The idea would be to have the date come first, so that
users would have a shot at identifying which of several messages is
most likely, and this would be searchable by eye with simply an
ordinary sorted index.