Re: [Mailman-Developers] [Bug 985149] Add List-Post value to permalink hash input

I want to step back for a moment and look at some fundamentals. I'll reply to other messages in this thread later, but in the context of my own thoughts expressed here.
TL;DR: I'm going to propose we keep the hash algorithm to include only the Message-ID as input.
Mailman core doesn't care what the algorithm is for calculating a permalink to the message in the archiver. All it cares about is that this can be calculated using local information only, with no round-trip to the archiver. Local information includes system configuration values, mailing list settings, and information available in the message being posted.
Multiple archivers can be enabled in any running Mailman core system, and these do not have to agree on the permalink calculation. Every archiver is free to use whatever algorithm it wants.
RFC 2369 and RFC 5064 rule us. This means that Mailman core will add a List-Archive header, which RFC 2369 defines as the "field describ[ing] how to access archives for the list". This header does not point to a specific message in the archive, but instead to the list's archive as a whole.
RFC 5064 defines the Archived-At header which "refer[s] to the archived form of a single message." If you get a message from Mailman which contains an Archived-At header, you should be able to click on that to view the message in the archive. It's this that I'm calling the 'permalink' to the message, and which must be calculated without round-tripping to the archiver.
So what is this hash thing and why do we need it? Well, strictly speaking, we don't. If UltraKitty wants to define the permalink as the URL-encoded Message-ID concatenate with the List-ID and Date, that's fine by Mailman. As long as item 0 above is satisfied, the core is happy.
Where I believe the hash is useful is by providing a more human-friendly string for *a* permalink to the message. It doesn't have to be the only one; it's just a convenience that you could imagine me reading to you over the phone or typing into an SMS with my stubby old bass player fingers.
If you think of an archiver in REST terms, the RFC 5064 header value is just one location for the resource (i.e. the message you care about) in the archiver. That same resource can have many different addresses; maybe you can look it up by raw Message-ID, URL-encoded Message-ID, permalink hash, or whatever. All roads lead to the same resource in the archiver. The permalink hash isn't required for any of this to work, it's purely a convenience.
It doesn't even matter if the permalink points to multiple resources.
Let's say a message gets cross-posted so that multiple copies of it show up at the archiver with the same Message-ID. The archiver can certainly treat these as separate resources, living at different canonical locations in its resource tree. But it could *also* honor a permalink that is identical for both messages. If you think of this as a tiny url for a search query, that could return multiple hits, each of which would be the different versions of the cross-posted message.
I could imagine a better UI though. Let's say this message got cross-posted:
From: Anne Person <aperson@example.com>
To: ant@example.org, bee@example.org
Subject: Ants and Bees are best friends!
Message-ID: <alpha>
Why should we fight? The mosquitoes are our common enemy!
Now, if all we use is the Message-ID to calculate the permalink, you might see both messages delivered to both mailing lists with the following RFC 5064 header:
Archived-At: http://lists.example.org/XZ3DGG4V37BZTTLXNUX4NABB4DNQHTCP
You click on that url and you're taken to a page which contains the archived message for one of the mailing lists (it doesn't matter which one), but you see a little extra link on the page:
View cross-posted thread in [ant@example.org] or [bee@example.org]
and those two links take you to the separate messages, at their canonical locations, in the thread appropriate for one or the other mailing list.
I'll note that in another message, Jeff advocates for using something shorter than 32 bytes for the hash, and letting collisions just work themselves out. Which frankly would be fine by me, but I see that as a very similar problem to the cross-posting problem. If the header said this instead:
Archived-At: http://lists.example.org/4DNQHTCP
clicking on it might bring you to a page like this:
Did you mean:
* [Ants and Bees are best friends] in [ant@example.org]
* [Ants and Bees are best friends] in [bee@example.org]
* [Mosquitoes unite!] in [bloodsuckers@example.org]
* [What about us dung beetles?] in [pests@example.org]
Notice I haven't advocated for a particular hash algorithm, because for my purposes right here, it doesn't matter. What I do feel strongly about is that the input to that hash should only include information directly available in the originally posted message, e.g. Message-ID. That way, if I get a copy from the mailing list, but you get a copy directly, we both have (almost[*]) all the information we need to calculate the same RFC 5064 URL.
Cheers, -Barry
[*] The one missing piece for the off-list copy is the value of the List-Archive header. If you can't find that out any other way, you're screwed, but my guess is that in practice, it will be easy to find that out. Or you can just Google the permalink hash or Message-ID to find the message.

I didn't mean to be presmtpuous. I think you are right that user interfaces can do a good job with crossposts. Here's a great example from GMane.
http://mid.gmane.org/20120323220013.0b1c88a8@resist.wooz.org
32 bytes too long?
Thirty-two characters means 50% likely to have a single collision once the archival database hits approximately 1.4 septillion messages.
Is 4 bytes too short?
Four characters is only about a million combinations. First collision is 50% likely at 1200 messages, and multi-million message databases are completely screwed.
Bottom line: how big a database do we expect to have, and amongst those messages, how many collisions are considered acceptable?
-Jeff
PS. These numbers assume a well balanced hash. This paper suggests SHA-1 is pretty good in non-adversarial situations, but I'm not an expert. http://cseweb.ucsd.edu/~mihir/papers/balance.html

I didn't mean to be presmtpuous. I think you are right that user interfaces can do a good job with crossposts. Here's a great example from GMane.
http://mid.gmane.org/20120323220013.0b1c88a8@resist.wooz.org
32 bytes too long?
Thirty-two characters means 50% likely to have a single collision once the archival database hits approximately 1.4 septillion messages.
Is 4 bytes too short?
Four characters is only about a million combinations. First collision is 50% likely at 1200 messages, and multi-million message databases are completely screwed.
Bottom line: how big a database do we expect to have, and amongst those messages, how many collisions are considered acceptable?
-Jeff
PS. These numbers assume a well balanced hash. This paper suggests SHA-1 is pretty good in non-adversarial situations, but I'm not an expert. http://cseweb.ucsd.edu/~mihir/papers/balance.html
participants (2)
-
Barry Warsaw
-
Jeff Breidenbach