[Mailman-Developers] Improving the archives
Barry Warsaw
barry at python.org
Tue Jul 24 21:11:27 CEST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Jul 24, 2007, at 12:31 PM, Jeff Breidenbach wrote:
>> What complexity? Mailman just does
>>
>> msg['X-List-Archive-Received-ID'] = Email.msgid()
>
> Easy to introduce, harder to deal with. The archival server would now
> keep track of both the message-id and the x-list-archive-received-id.
> That's two namespaces that almost do the same thing. It's easier
> for the archive server to keep track of one name space than two,
> and - most importantly - conceptually simpler.
True, but an archiver already has to handle collisions on the Message-
ID so in a sense, you have to maintain multiple paths to the same
message, don't you?
So I like my proposal because it imposing nothing additional on the
MUA or MTA, a tiny bit more on the MLM, and some extra work (though I
think not much) on the archiving agent. What you gain from my
proposal over a pure Message-ID approach is guaranteed uniqueness
given the list copy, and human friendlier urls.
>> From the perspective of the assorted list servers, it's easier to
> do nothing than to do something. So if they can get by with
> just message-id (which is already implemented) not have to add
> x-list-archive-received-id, that's a smoother implementation path.
> If we base on message-id, archival servers will be able to
> retroactively add support for all their stored messages, even those
> that are ten years old. And users holding an old message will be
> able to figure out that URL without doing any computational
> gymnastics.
All these are still true with my proposal, except with the
observation as Stephen points out that given a URL based on sender-
provided headers, you must be prepared to deal with collisions, so
sometimes your resources will return lists. The advantage of adding
a bit of MLM-provided information is that given the list copy you can
guarantee uniqueness, and given the off-list copy you can get to a
resource that contains a link to the message you want.
> Put another way, there's the possibility to reduce the archive
> servers' implementation to "search for this mesage-id" which is
> something really useful to have anyway, and therefore likely to
> get wider support.
>
> In addition, Barry was talking about concocting a unique
> identifier from the Date field and Message-ID. I'm not a big fan of
> this idea, because the date field comes from the mail user agent
> and is often wildly corrupt; e;g; coming from 100 years in the future.
> Very painful if the archive is showing most recent message first.
> Therefore an archival server is very likely to determine message date
> from the most recent received header (generally from a trusted mail
> transfer agent) rather than the date field. From the archive server's
> perspective, the best thing to do with the date field is throw it
> away.
Throw it away or hide it? The former would be a problem, but not the
latter. Does your archiver keep a canonical copy of the message as
you received it? If so, then you preserve the original Date header
enough for the calculation to occur, even if you hide the Date
header, or display a Received header date when you render it to
HTML. That doesn't matter of course.
But I should point out that I'm not married to including the Date
header in the hash. I like it because it appears to reduce
collisions which I care about. But I still like using the base32
sha1 hash instead of the raw Message-ID because I think it's easier
for humans to use, read, speak, and copy. Of course this doesn't
mean that you need to disable your search-by-Message-ID feature!
> So for these reasons, I'd rather stick with message-id and risk
> some real world collisions, instead of introduce another identifier.
> If the list server receives a message with no message-id, by all means
> create one on the spot. To me, this feels like the sweet spot in
> terms
> of cost benefit. The main thing that bugs me is message-ids are long,
> which makes them awkward to embed in a URL in the footer of a
> message.
Another advantage for the URL scheme I propose. You know you're
going to end up with URLs of len(host-prefix) + 32 + 1 + #digits-in-
seqno
(32 == base32(sha1digest(data))
(1 == / divider)
(#digits-in-seqno == e.g. len(str(seqno))
You should be able to keep things in the 60-70 character range,
including the host name. That doesn't seem too bad.
- -Barry
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
iQCVAwUBRqZO4HEjvBPtnXfVAQIYGwP/VZPCiQrg9CTeMThApNTh7xUismbW0AiT
1N6a8DusXDBrqiLDQd+v2/R5KOV+TnwDNlIcl5FfFatHxWJ0bGy850kT/nhrHdKU
UrW0hR8PWSMIRN5Bqx9bL9cvaMigAoyX+njAfiDgl0yy7arbAm66GH1HNH3c1XGT
1/qaGckINUg=
=4uwH
-----END PGP SIGNATURE-----
More information about the Mailman-Developers
mailing list