[Mailman-Developers] [Bug 985149] Add List-Post value to permalink hash input

Tue Apr 24 01:31:15 CEST 2012

I want to step back for a moment and look at some fundamentals.  I'll reply to
other messages in this thread later, but in the context of my own thoughts
expressed here.

TL;DR: I'm going to propose we keep the hash algorithm to include only the
Message-ID as input.

0) Mailman core doesn't care what the algorithm is for calculating a permalink
   to the message in the archiver.  All it cares about is that this can be
   calculated using local information only, with no round-trip to the
   archiver.  Local information includes system configuration values, mailing
   list settings, and information available in the message being posted.

1) Multiple archivers can be enabled in any running Mailman core system, and
   these do not have to agree on the permalink calculation.  Every archiver is
   free to use whatever algorithm it wants.

2) RFC 2369 and RFC 5064 rule us.  This means that Mailman core will add a
   List-Archive header, which RFC 2369 defines as the "field describ[ing] how
   to access archives for the list".  This header does not point to a specific
   message in the archive, but instead to the list's archive as a whole.

   RFC 5064 defines the Archived-At header which "refer[s] to the archived
   form of a single message."  If you get a message from Mailman which
   contains an Archived-At header, you should be able to click on that to view
   the message in the archive.  It's this that I'm calling the 'permalink' to
   the message, and which must be calculated without round-tripping to the
   archiver.

So what is this hash thing and why do we need it?  Well, strictly speaking, we
don't.  If UltraKitty wants to define the permalink as the URL-encoded
Message-ID concatenate with the List-ID and Date, that's fine by Mailman.  As
long as item 0 above is satisfied, the core is happy.

Where I believe the hash is useful is by providing a more human-friendly
string for *a* permalink to the message.  It doesn't have to be the only one;
it's just a convenience that you could imagine me reading to you over the
phone or typing into an SMS with my stubby old bass player fingers.

If you think of an archiver in REST terms, the RFC 5064 header value is just
one location for the resource (i.e. the message you care about) in the
archiver.  That same resource can have many different addresses; maybe you can
look it up by raw Message-ID, URL-encoded Message-ID, permalink hash, or
whatever.  All roads lead to the same resource in the archiver.  The permalink
hash isn't required for any of this to work, it's purely a convenience.

It doesn't even matter if the permalink points to multiple resources.

Let's say a message gets cross-posted so that multiple copies of it show up at
the archiver with the same Message-ID.  The archiver can certainly treat these
as separate resources, living at different canonical locations in its resource
tree.  But it could *also* honor a permalink that is identical for both
messages.  If you think of this as a tiny url for a search query, that could
return multiple hits, each of which would be the different versions of the
cross-posted message.

I could imagine a better UI though.  Let's say this message got cross-posted:

    From: Anne Person <aperson at example.com>
    To: ant at example.org, bee at example.org
    Subject: Ants and Bees are best friends!
    Message-ID: <alpha>

    Why should we fight?  The mosquitoes are our common enemy!

Now, if all we use is the Message-ID to calculate the permalink, you might see
both messages delivered to both mailing lists with the following RFC 5064
header:

    Archived-At: http://lists.example.org/XZ3DGG4V37BZTTLXNUX4NABB4DNQHTCP

You click on that url and you're taken to a page which contains the archived
message for one of the mailing lists (it doesn't matter which one), but you
see a little extra link on the page:

    View cross-posted thread in [ant at example.org] or [bee at example.org]

and those two links take you to the separate messages, at their canonical
locations, in the thread appropriate for one or the other mailing list.

I'll note that in another message, Jeff advocates for using something shorter
than 32 bytes for the hash, and letting collisions just work themselves out.
Which frankly would be fine by me, but I see that as a very similar problem to
the cross-posting problem.  If the header said this instead:

    Archived-At: http://lists.example.org/4DNQHTCP

clicking on it might bring you to a page like this:

    Did you mean:
      * [Ants and Bees are best friends] in [ant at example.org]
      * [Ants and Bees are best friends] in [bee at example.org]
      * [Mosquitoes unite!] in [bloodsuckers at example.org]
      * [What about us dung beetles?] in [pests at example.org]

Notice I haven't advocated for a particular hash algorithm, because for my
purposes right here, it doesn't matter.  What I do feel strongly about is that
the input to that hash should only include information directly available in
the originally posted message, e.g. Message-ID.  That way, if I get a copy
from the mailing list, but you get a copy directly, we both have (almost[*])
all the information we need to calculate the same RFC 5064 URL.

Cheers,
-Barry

[*] The one missing piece for the off-list copy is the value of the
List-Archive header.  If you can't find that out any other way, you're
screwed, but my guess is that in practice, it will be easy to find that out.
Or you can just Google the permalink hash or Message-ID to find the message.