[Mailman-Developers] Improving the archives

Wed Jul 25 15:34:13 CEST 2007

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 25, 2007, at 12:47 AM, Jeff Breidenbach wrote:

>> What you gain from my proposal over a pure Message-ID approach
>> is guaranteed uniqueness given the list copy
>
> Guarantee is a pretty strong word. A malicious person could post two
> messages with the same message-id, same date, but different bodies.

No question, if the archive service and the list server are not  
intimately connected, the communication channel between the two can  
be subverted.  There are ways that channel's trust could be enhanced  
though, for example by the list server signing its headers in a dkim- 
like fashion.

But in situations where the two are co-located, you can trust these  
headers even without that enhancement.

> So that moves us to how many collisions are reduced in practice.
> I have a question about the numbers Barry mined from the python
> lists. Are the collisions really that high? One should not count
> messages without a message-id, because the MLM can and should
> create one in that case.

I've uploaded the script I used to here:

http://wiki.list.org/download/attachments/786633/scan.py?version=1

It's probably not perfect, and certainly the python.org mbox's may  
not be representative enough of the real world.  Please grab the  
script, tweak it and run it over your own raw archives; it should be  
easily modified to handle any of the mailbox formats supported by  
Python 2.5's mailbox module.

If you improve the script or find numbers that lead to different  
conclusions, now's the time to know!

>> and human friendlier urls.
>
> That's a very compelling point.
>
> SHA1 can't be computed inside someone's head or simple cut-n-pasted
> together for old messages,  but I think the usability benefits of  
> short
> URLs (short enough that they can comfortably fit inside message  
> bodies)
> outweighs this drawback. By the way, is SHA-1 still in favor? My
> impression was it was fading away after the Shandong University team
> partially cracked it.

We're not concerned with the cryptographic security claims of SHA1.   
I don't see any economically beneficial attack on the archives  
against SHA1 here.  I think SHA1 is reasonably universally available,  
and marginally better than MD5, so it's probably good enough for this  
application.

You're right that no one is going to do SHA1 in their heads, and if  
they could, they're probably working for some TLA in a secret gubmit  
basement lab somewhere.  The point of course is that a /program/  
could easily apply the algorithm to a very minimal existing message  
and come up with the same canonical url.  This enables all kinds of  
cool applications based on REST-y principles or whatever.  The fact  
that the algorithm leads to short(ish), largely unambiguous (to  
humans), readable urls is an important benefit -- probably /the/ most  
important benefit.

>> Throw it away or hide [Date]?  The former would be a problem,
>> but not the latter.
>
> Thrown away.

Really?  Wow.  I'd have thought every archiving service would want to  
keep a record of the raw message it received on the wire.  That would  
allow it to regenerate the html archive if necessary, provide useful  
forensics, and allow for exactly the kind of data mining we're doing  
here.  I can't see /any/ reason for not saving the raw messages in  
their entirety, especially for a public list.  Maybe for a private  
one, where your data retention policies require you delete things  
after a certain amount of time, but even there, I can't see why you'd  
want to trim raw messages rather than just chucking them entirely.

> My favorite archival service is based on mhonarc,
> and raw mail goes into offline cold storage.

What's the advantage of that?  Isn't disk space cheap as dirt?   
Probably cheaper if you've bought any topsoil recently :).  Still,  
the raw messages are still available right?  So if there was enough  
value in calculating the canonical urls so that the archive service  
could be seen as an interoperability good citizen, then it could be  
done.

I'll just reiterate that I'm not married to including the Date header  
in the algorithm.  Until proven otherwise by more research, I think  
it's a good idea to use because 1) it's required by RFC 2822 and 2)  
it seems to reduce collisions.  I think the algorithm I propose would  
work just as well with Message-IDs alone, although there's more of a  
chance that the non-sequence numbered url will return multiple matches.

- -Barry

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdRVnEjvBPtnXfVAQJiOgP/UIufdisvgVPV3qKo4dV2bfWoUPcp/dIQ
iGj9faWXFwa/NoOk3HtIZbu7JVrJEY2t9nihJX6lEjZ1Q6AFH1hkObx0dV5NRfj2
KjRANxU6UsBvpDCzBQWthX1d7HviRJ74Pio5hVti+0YoV4pjq8UHaxTlrECHmkad
ERlOYR2onAQ=
=8b8I
-----END PGP SIGNATURE-----