Mailman 3 Re: [Mailman-Developers] [Bug 985149] Add List-Post value to permalink hash input - Mailman-Developers

23 Apr 2012

      On Apr 20, 2012, at 01:19 PM, Jeff Breidenbach wrote:
...

Terri is exactly right. The reason for including list identity as
part of the hash calculation is for cross-posted messages. An
archiving service shows context. Here's the message AND the thread it
fits into, AND information about the list it travelled over AND the
ability to search that list further. Archives need to know the list to
provide context.

Agreed, but I think you'll get all that information anyway, without it being
expressed in the hash.  You'll get a full copy of the posted message, so
you'll get the Message-ID, To header (i.e. the posting address), List-Post (if
there is one), List-ID, etc.
...

The reason mail-archive.com uses List-Post and not List-Id in the
calculation is because every list, RFC2369 compliant or not, has a
concept of a posting address. It is natural idea, easy to think of and
understand. Hence all mail-archive.com archives are keyed off of
posting address. It would be technical possible (but an architectural
pain) for mail-archive.com to calculate using List-Id. We'd probably
not bother and instead store whatever was calculated by mailman and
placed in the Archived-At: header. Okay, I'll admit my prejudice. I've
always found List-Id annoying, and wish that it didn't exist.

Note that the message you receive may not have a useful List-Post header at
all!  From RFC 2369:
3.4. List-Post
The List-Post field describes the method for posting to the list.
This is typically the address of the list, but MAY be a moderator, or
potentially some other form of submission. For the special case of a
list that does not allow posting (e.g., an announcements list), the
List-Post field may contain the special value "NO".
(I think neither mm2 nor mm3 does this right.  See LP: #987563)
...

As long as things are changing, I want to mention that these URLs
feel too long. SHA-1 is a 160 bit hash consuming 32 URL characters. I
think trimming to a 64 bit (13 character) hash is plenty. According to
wikipedia collision tables, with the shorter hash we'd expect to get
our first collision after archiving 5 billion messages. That's 50X the
current corpus size of public archival services like GMane. And it
isn't like an occasional hash collision is a big deal or a security
problem. http://en.wikipedia.org/wiki/Birthday_attack

Let's say we take the lower 80 bits of the SHA1.  After base32 encoding, that
leaves us with 16 bytes.  Of course, we could also use the full 160 bit SHA1
hash, and take only the lower X number of bytes after the base32 encoding.
I'm all in favor of a shorter URL, but someone with better Maths-Fu will have
to propose a specific algorithm that adequately trades off collisions for
human-friendliness.  Also, note the implications of increased collisions on
the whole argument, which I brought up in my previous message.
...
3b) For that matter, a sequence number would also do the trick, but I
can understand that this is much more dangerous; it is easy for a
sequence number to get reset and cause all hell to break loose.
It would also be nearly impossible to preserve the zeroth principle, that
Mailman and the archiver can agree on the permalink for a message with no
communication between them.
...

I'm really not that picky. Our archival service could deal with all
sorts of URLs, including the ones Terri was trying to avoid, such as
http://example.com/archiver/listname.example.com/$hash
In fact, we've found that lots of small, per-list databases have speed
and reliability advantages over big global databases. But I also like
short URLs. Bottom line, please don't let these comments delay or
derail forward progress.

No worries!  We'll hash (pun intended ;) this out in plenty of time before 3.0
final.  With Richard's suggestion of a version number, we could even roll out
updates in future versions, although it would probably be more of a PITA for
you by then, than us. :)
Cheers,
-Barry

Re: [Mailman-Developers] [Bug 985149] Add List-Post value to permalink hash input

Barry Warsaw

tags

participants (1)