[Mailman-Developers] Improving the archives
Barry Warsaw
barry at python.org
Wed Jul 25 15:34:13 CEST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Jul 25, 2007, at 12:47 AM, Jeff Breidenbach wrote:
>> What you gain from my proposal over a pure Message-ID approach
>> is guaranteed uniqueness given the list copy
>
> Guarantee is a pretty strong word. A malicious person could post two
> messages with the same message-id, same date, but different bodies.
No question, if the archive service and the list server are not
intimately connected, the communication channel between the two can
be subverted. There are ways that channel's trust could be enhanced
though, for example by the list server signing its headers in a dkim-
like fashion.
But in situations where the two are co-located, you can trust these
headers even without that enhancement.
> So that moves us to how many collisions are reduced in practice.
> I have a question about the numbers Barry mined from the python
> lists. Are the collisions really that high? One should not count
> messages without a message-id, because the MLM can and should
> create one in that case.
I've uploaded the script I used to here:
http://wiki.list.org/download/attachments/786633/scan.py?version=1
It's probably not perfect, and certainly the python.org mbox's may
not be representative enough of the real world. Please grab the
script, tweak it and run it over your own raw archives; it should be
easily modified to handle any of the mailbox formats supported by
Python 2.5's mailbox module.
If you improve the script or find numbers that lead to different
conclusions, now's the time to know!
>> and human friendlier urls.
>
> That's a very compelling point.
>
> SHA1 can't be computed inside someone's head or simple cut-n-pasted
> together for old messages, but I think the usability benefits of
> short
> URLs (short enough that they can comfortably fit inside message
> bodies)
> outweighs this drawback. By the way, is SHA-1 still in favor? My
> impression was it was fading away after the Shandong University team
> partially cracked it.
We're not concerned with the cryptographic security claims of SHA1.
I don't see any economically beneficial attack on the archives
against SHA1 here. I think SHA1 is reasonably universally available,
and marginally better than MD5, so it's probably good enough for this
application.
You're right that no one is going to do SHA1 in their heads, and if
they could, they're probably working for some TLA in a secret gubmit
basement lab somewhere. The point of course is that a /program/
could easily apply the algorithm to a very minimal existing message
and come up with the same canonical url. This enables all kinds of
cool applications based on REST-y principles or whatever. The fact
that the algorithm leads to short(ish), largely unambiguous (to
humans), readable urls is an important benefit -- probably /the/ most
important benefit.
>> Throw it away or hide [Date]? The former would be a problem,
>> but not the latter.
>
> Thrown away.
Really? Wow. I'd have thought every archiving service would want to
keep a record of the raw message it received on the wire. That would
allow it to regenerate the html archive if necessary, provide useful
forensics, and allow for exactly the kind of data mining we're doing
here. I can't see /any/ reason for not saving the raw messages in
their entirety, especially for a public list. Maybe for a private
one, where your data retention policies require you delete things
after a certain amount of time, but even there, I can't see why you'd
want to trim raw messages rather than just chucking them entirely.
> My favorite archival service is based on mhonarc,
> and raw mail goes into offline cold storage.
What's the advantage of that? Isn't disk space cheap as dirt?
Probably cheaper if you've bought any topsoil recently :). Still,
the raw messages are still available right? So if there was enough
value in calculating the canonical urls so that the archive service
could be seen as an interoperability good citizen, then it could be
done.
I'll just reiterate that I'm not married to including the Date header
in the algorithm. Until proven otherwise by more research, I think
it's a good idea to use because 1) it's required by RFC 2822 and 2)
it seems to reduce collisions. I think the algorithm I propose would
work just as well with Message-IDs alone, although there's more of a
chance that the non-sequence numbered url will return multiple matches.
- -Barry
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
iQCVAwUBRqdRVnEjvBPtnXfVAQJiOgP/UIufdisvgVPV3qKo4dV2bfWoUPcp/dIQ
iGj9faWXFwa/NoOk3HtIZbu7JVrJEY2t9nihJX6lEjZ1Q6AFH1hkObx0dV5NRfj2
KjRANxU6UsBvpDCzBQWthX1d7HviRJ74Pio5hVti+0YoV4pjq8UHaxTlrECHmkad
ERlOYR2onAQ=
=8b8I
-----END PGP SIGNATURE-----
More information about the Mailman-Developers
mailing list