Re: [Mailman-Developers] [Bug 985149] Add List-Post value to permalink hash input
On Apr 20, 2012, at 01:19 PM, Jeff Breidenbach wrote:
- Terri is exactly right. The reason for including list identity as part of the hash calculation is for cross-posted messages. An archiving service shows context. Here's the message AND the thread it fits into, AND information about the list it travelled over AND the ability to search that list further. Archives need to know the list to provide context.
Agreed, but I think you'll get all that information anyway, without it being expressed in the hash. You'll get a full copy of the posted message, so you'll get the Message-ID, To header (i.e. the posting address), List-Post (if there is one), List-ID, etc.
- The reason mail-archive.com uses List-Post and not List-Id in the calculation is because every list, RFC2369 compliant or not, has a concept of a posting address. It is natural idea, easy to think of and understand. Hence all mail-archive.com archives are keyed off of posting address. It would be technical possible (but an architectural pain) for mail-archive.com to calculate using List-Id. We'd probably not bother and instead store whatever was calculated by mailman and placed in the Archived-At: header. Okay, I'll admit my prejudice. I've always found List-Id annoying, and wish that it didn't exist.
Note that the message you receive may not have a useful List-Post header at all! From RFC 2369:
3.4. List-Post
The List-Post field describes the method for posting to the list. This is typically the address of the list, but MAY be a moderator, or potentially some other form of submission. For the special case of a list that does not allow posting (e.g., an announcements list), the List-Post field may contain the special value "NO".
(I think neither mm2 nor mm3 does this right. See LP: #987563)
- As long as things are changing, I want to mention that these URLs feel too long. SHA-1 is a 160 bit hash consuming 32 URL characters. I think trimming to a 64 bit (13 character) hash is plenty. According to wikipedia collision tables, with the shorter hash we'd expect to get our first collision after archiving 5 billion messages. That's 50X the current corpus size of public archival services like GMane. And it isn't like an occasional hash collision is a big deal or a security problem. http://en.wikipedia.org/wiki/Birthday_attack
Let's say we take the lower 80 bits of the SHA1. After base32 encoding, that leaves us with 16 bytes. Of course, we could also use the full 160 bit SHA1 hash, and take only the lower X number of bytes after the base32 encoding. I'm all in favor of a shorter URL, but someone with better Maths-Fu will have to propose a specific algorithm that adequately trades off collisions for human-friendliness. Also, note the implications of increased collisions on the whole argument, which I brought up in my previous message.
3b) For that matter, a sequence number would also do the trick, but I can understand that this is much more dangerous; it is easy for a sequence number to get reset and cause all hell to break loose.
It would also be nearly impossible to preserve the zeroth principle, that Mailman and the archiver can agree on the permalink for a message with no communication between them.
- I'm really not that picky. Our archival service could deal with all sorts of URLs, including the ones Terri was trying to avoid, such as http://example.com/archiver/listname.example.com/$hash In fact, we've found that lots of small, per-list databases have speed and reliability advantages over big global databases. But I also like short URLs. Bottom line, please don't let these comments delay or derail forward progress.
No worries! We'll hash (pun intended ;) this out in plenty of time before 3.0 final. With Richard's suggestion of a version number, we could even roll out updates in future versions, although it would probably be more of a PITA for you by then, than us. :)
Cheers, -Barry
participants (1)
-
Barry Warsaw