[Mailman-Developers] Improving the archives
Barry Warsaw
barry at python.org
Fri Jul 20 14:02:34 CEST 2007
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On Jul 4, 2007, at 3:49 AM, Stephen J. Turnbull wrote:
>> Along with that, I would really like to come up with an algorithm for
>> calculating those urls without talking to the archiver.
>
> Brad didn't like this when I suggested it before, but I didn't really
> understand why not. Anyway, FWIW:
>
> I suggest adding an X-List-Received-ID header to all messages. I
> haven't really thought through whether the UUID in that field should
> be at least partly human-readable or not, but that doesn't matter for
> the basic idea.[1] The on-disk directory format would be
>
> /path-to-archive/private/my-list/Message-ID
>
> for singletons (Message-ID is the author-supplied ID) and
>
> /path-to-archive/private/my-list/Message-ID/List-Received-ID
>
> for multiples. These would be created on-the-fly when they occur.
> They can be served as static pages. For almost all messages, the bare
> URL
>
> http://archives.example.com/my-list/Message-ID
>
> should Just Work (ie, return a no-such-object result or a single
> message). Where it does not, you get an index of all pages with that
> message ID.
I think this suggestion has merit, but I'm going to riff on it a bit.
First, I want to avoid talking about file system layout. To me,
that's an implementation detail we needn't worry about right now.
Maybe the files will live on disk, maybe they'll live in a database,
maybe they'll live in an external system we don't control. I don't
care. What I want is a uniform way to calculate an address for a
message given nothing but its text and an interface for retrieving
messages from a service given that address. I'm thinking about this
in a RESTful way, and it's perfectly legitimate for that 'message
address' to be relative to some archive or message store root.
I've done some experiments. I took the top 5 mbox files on
python.org and ran them through a script that looked for message-id
collisions. Then I implemented 6 strategies for looking at whether
the collisions were true collisions or duplicates. Duplicates are
defined where every message in the same message-id bucket has the
same match criteria, and collisions are where at least one message in
the bucket is different. So for example, with strategy 2, if the
message-id and date headers are the same for every message in the
bucket, it's a dupe, otherwise it's a collision.
While I ran the script over each mbox separately, I think it's more
interesting to talk about them as a whole collection. I don't really
know how representative this would be of the world at large, but it's
interesting anyway. FTR, the lists were mailman-users, python-dev,
python-help, python-list, and tutor. I think there would be little
intentional cross-posting between these lists. Here are the numbers:
total 325146, missing: 624
1. msg.as_string(), dup: 34 (0.0104568409268%), col: 914
(0.281104488445%)
2. message-id + date, dup: 875 (0.269109876794%), col: 73
(0.0224514525782%)
3. message-id + 1st received, dup: 270 (0.0830396191249%), col: 678
(0.208521710247%)
4. message-id + all received, dup: 270 (0.0830396191249%), col: 678
(0.208521710247%)
5. message-id + date + 1st received, dup: 268 (0.0824245108351%),
col: 680 (0.209136818537%)
6. body_line_iterator(msg), dup: 659 (0.202678181494%), col: 289
(0.0888831478782%)
Notice that of 325146 total messages, 624 of them had no message-id
header. Even if you aggregate dup+col, you're still looking at a
total duplicate rate of 0.29%. While I'm almost tempted to ignore a
hit rate that low, if you think of an archive holding 1B messages,
you still get a lot of duplicates.
OTOH, the rate goes down even lower if you consider the message-id
and date headers. (Note, I did not consider messages missing a date
header). How likely is it that two messages with the same message-id
and date are /not/ duplicates? Heck, at that point, I'd feel
justified in simply automatically rejecting the duplicate and
chucking it from the archive.
I spent a /little/ time looking at the physical messages that ended
up as true collisions. Though by no means did I look at them all,
they all looked related. For example, with strategy 2 some messages
look like they'd been inadvertently sent before they were completed.
I need to see if there's any similarities in MUA behind these, but
again, I think we might be able to safely assume that collisions on
message-id+date can be ignored.
That leads me to the following proposal, which is just an elaboration
on Stephen's. First, all messages live in the same namespace; they
are not divided by target mailing list. Each message has two
addresses, one is the Message-ID and one is the base32 of the sha1
hash of the Message-ID + Date. As Stephen proposes, Mailman would
add these headers if an incoming message is missing them, and tough
luck for the non-list copy. The nice thing is that RFC 2822 requires
the Date header and states that Message-ID SHOULD be present.
Why the second address? First, it provides as close to a guaranteed
unique identifier as we can expect, and second because it produces a
nearly human readable format. For example, Stephen's OP would have a
second address of
>>> mid
'<87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp>'
>>> date
'Wed, 04 Jul 2007 16:49:58 +0900'
>>> # XXX perhaps strip off angle brackets
>>> h = hashlib.sha1(mid)
>>> h.update(date)
>>> base64.b32encode(h.digest())
'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI'
I like base32 instead of base64 because the more limited alphabet
should produce less ambiguous strings in certain fonts and I don't
think the short b64 strings are short enough to justify the
punctuation characters that would result. While RFC 3548 specifies
the b32 alphabet as using uppercase characters, I think any service
that accepts b32 ids should be case insensitive. A really Postel-y
service could even accept '1' for 'I' and '0' for 'O' just to make it
more resilient to human communication errors.
I'd like to come up with a good name for this second address, which
would suggest the name of the X- header we stash this value in. X-
B32-Message-ID isn't very sexy. Maybe X-Message-Global-ID, since I
think there's a reasonable argument to make that for well-behaved
messages, that's exactly what this is.
So now, think of the interface to a message store that supports this
addressing scheme. Well it's something like:
class MessageStore(Interface):
def store_message(message):
"""Store the message.
:raises ValueError: when the message is missing either the
Message-ID
header or a Date header.
:raises DuplicateMessageError: when a message in the store
already has
a matching Message-ID and Date. An archive is free to raise
this exception
for duplicate Message-IDs alone.
"""
def get_message_by_global_id(key):
"""Locate and return the message from the store that matches
`key`.
:param key: The Global ID of the message to locate. This is
the
base32 encoded SHA1 hash of the message's Message-ID and Date
headers.
:returns: The message object matching the Global ID, or None
if there
is no such match.
"""
def get_messages_by_message_id(key):
"""Return the set of messages with a matching Message-ID `key`.
:param key: The Message-ID of the messages to locate.
:returns: The set of all messages in this store that have
the given
Message-ID. If none such matches are found, the empty set is
returned.
"""
As far as generating pages based on the Message-ID or global id, I
agree with Stephen's proposal. A page returned in response to a
message-id request could return the message page or it could return
an index of such messages. It would be up to the archive whether it
would accept duplicate Message-IDs or not, but it would always be
guaranteed that a page returned in response to a global id request
would return one email message.
Urls could be calculated by concatenating the List-Archive and X-
Global-Message-ID headers, e.g.
http://mail.python.org/pipermail/mailman-developers/
RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
would be the OP. This could point to the same resource as
http://mail.python.org/pipermail/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
http://mail.python.org/pipermail/global/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI
and /might/ point to the same resource as:
http://mail.python.org/pipermail/mailman-developers/
87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp
http://mail.python.org/pipermail/mids/
87myycy5eh.fsf at uwakimon.sk.tsukuba.ac.jp
> A minor drawback to my proposal is that if a message gets archived as
> a singleton for that Message-ID, then a duplicate arrives, previously
> created references in the archive will of course now return an index
> rather than the desired message. Ie, there is data corruption. This
> can be dealt with in several ways; the easiest would be to provide a
> "if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking-
> for-me"
> link when creating the directory for multiple instances.
Or by using the global id, or by rejecting messages with duplicate
message ids.
> There's also a *very* minor benefit: repeat sends will be immediately
> recognizable without checking Message-ID.
>
> Footnotes:
> [1] By partly human-readable I mean containing list-id and date
> information. The idea would be to have the date come first, so that
> users would have a shot at identifying which of several messages is
> most likely, and this would be searchable by eye with simply an
> ordinary sorted index.
I see searching, indexing, sorting, and providing other human
readable urls into the message store as a function of the archive.
Once you're looking at a link to the actual message, you're going to
be looking at a url that contains the global id, regardless of the
number of levels you have to go through or redirects involved.
Apologies for letting this thread linger so long. I'm very
interesting in hearing your thoughts and if there's general
agreement, I'll write it up in the wiki.
- -Barry
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
iQCVAwUBRqCkWnEjvBPtnXfVAQIRhAP7BkuF5K0xOuie2GBqOOWDarksD5Oy49y9
/2WO+u4xH+BttIt3adHJS+K6ETYcK79c5Rf4uwZk40DqWKK7ay1zkxUn/LGXOJ0o
CoWQG5ZyFUJUTkDXtxEWcZ8kkXaDTTSNz2eCtYgQAXw77A95E1SjV0YBs54bFK3A
Bi9cjrKRDcM=
=pyY6
-----END PGP SIGNATURE-----
More information about the Mailman-Developers
mailing list