Mailman 3 Improving the archives - Mailman-Developers

Improving the archives

Terri Oda

July 3, 2007

3:06 a.m.

Since I've largely finished up the coding contract that was eating up
a lot of my time, I'm thinking that I'd like to do some coding for
fun. And nothing says fun like trying to fix the Mailman archives! ;)

I'm trying to remember all the things people have suggested for the
archives in the past so I can figure out what needs to be done and
what might be nice to have, and see if this is doable in the time I
have in the foreseeable future.

The big things people wanted most, if I recall correctly, included:

modernized HTML/CSS/Themes (preferably to match a modernized web
interface... is that all set up now?)
archive links that won't break if the archive is rebuilt
better address obfuscation (maybe by generating pages through cgi)
search
not adding a billion dependencies to Mailman

Here's the list from the wiki's Mailman 2.2 page: http:// wiki.list.org/display/DEV/Mailman+2.2

 *  Reconsider using a 3rd-party archiver
 * Perhaps URLs to messages should be based on message-ids

instead of message numbers so that regenerating archives can't break
links. This must include backward compatible links * Ditch direct access and vend all archive messages through CGI
so that we can do address obfuscation, and message deletion, etc. on
the fly (with caching of course, but have to worry about web crawlers). * Add RSS feed * Allow for admins to remove or edit messages through the web. * Move archive threads into another list? * Put archives in the list/mylist directory. * Add a search option * Make archives default template look and feel similar to Web UI
(whatever it looks like after the Summer of Code project is done) * Make archive templatable (at least by changing CSS) so they
can match people's existing site look-and-feel * MUAs usually make URLs clickable. An new Archive could be used
when posts are distributed, in the footer, so that each message has a
link to the whole thread in the Archive. * Present all messages in a thread at once, and offer plaintext
download of the whole thread * Put messages into a database and/or move away from mbox as the
canonical storage format.

So the questions are:

(1) Is anyone working on this already? (2) What else is on people's wish lists for a pipermail replacement?

Terri

Show replies by date

Steve Huston

July 2007

11:36 a.m.

I'll admit to not having read previous discussions on this topic, but I'll also add my 2 <insert-lowest-denomination-coin> here:

On 7/2/07 11:06 PM, Terri Oda wrote:

...

better address obfuscation (maybe by generating pages through cgi)

I run a few Wordpress sites, and there's a plugin I use called PHPEnkoder which does a good job of this. It basically wraps the address around a little bit of Javascript; if you have Javascript turned on in the browser, it's seamless, and if not you see "Javascript required to view address" or something like that. The theory is that bots and such don't run JS, so it's "safe" from harvesting. I'll leave it to the list as to how true an assessment this is, but it Works For Me :>

...

 * Add a search option

I know there's been patches around forever that integrate ht://Dig with Pipermail; maybe some way to do this, while still making it an option that can be tuned? If ht://Dig is there and you turn on the option, it works, but if it's not then it's not required? This would satisfy the "not adding a billion dependencies", but may be overkill as well. I'll also happily admit to not knowing much about the cost of search engines to a system.

...

 * MUAs usually make URLs clickable. An new Archive could be used  
when posts are distributed, in the footer, so that each message has a
link to the whole thread in the Archive.

This would be a Godsend. A group at work here runs an old homebrewed exploder, and a few years ago I tried to convert them to Mailman. They liked everything they saw, up until the point where they couldn't refer to some kind of short and simple message number, and get right to that message in the archive. The current system generates a number based on a simple incrementing index of the list, and many months after a mailing people will refer to "message #483", and know they can view it at http://hostname/foo/listname/483.html - which is also posted in the footer of the message sent out. Of course, if the archives were based on Message-ID headers, this may make such a number a bit unwieldly, but if it were some kind of simple-ish system I might finally get rid of those old lists :>

-- Steve Huston - W2SRH - Unix Sysadmin, Dept. of Astrophysical Sciences Princeton University | ICBM Address: 40.346525 -74.651285 126 Peyton Hall |"On my ship, the Rocinante, wheeling through Princeton, NJ 08544 | the galaxies; headed for the heart of Cygnus, (609) 258-7375 | headlong into mystery." -Rush, 'Cygnus X-1'

Barry Warsaw

12:13 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Steve makes me think of a couple of other wish list items.

On Jul 3, 2007, at 7:36 AM, Steve Huston wrote:

...

I have this idea that you could gateway messages from an archive or
mailing list to and from a bulletin board forum. Maybe this doesn't
fall within the scope of the archiver because I could see a 'forum
queue' like we have an nntp queue, but in that case, being able to
calculate an archive url without talking to the archiver becomes
important again. It would be nice in that case to put a link to the
archive message in the forum post.

...

This reminds me, I would love to have a link in an archive message
that I could click to get the message sent to me, as it originally
appeared on the mailing list. If I had that, I'd never need to
locally save another mailing list post. I'd just search for the one
I wanted, go to the archive, click on the "send it to me" link, then
do a normal reply in my mail reader.

...

This would be possible with today's system, but it leads to unstable
urls, especially when you consider archive scrubbing (which, come to
think of it, is another wish list item ;). We'd like for an admin to
be able to easily pull an archive message, but it's even worse than
that. Sometimes an admin has to scrub the actual backing message
store (e.g. today's mbox file). This will change the message counts
and thus the incremental indexes.

Maybe a way to think about this is that the canonical url is based on
the message-id, but then there's some way to distill even this down
to a tinyurl or simple integer that would be stable in the face of
full archive regenerations.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRormRHEjvBPtnXfVAQIHYwP/fLnY/pebRlhrFeUpPJu5VfZNyR24oLId qjZ4F2MHW25LcemvGzpeUSgXRQJk2LQIQKSlYYtTM+8xcStey4IvDnPLmzX5MQOC xiI9PznZHdLmbF9SaUDZQZBRKZhqCNeslZ5zpnN35KStL3NlTc6PkBylzIC7Y47F a3RxMEOgMaA= =HM9I -----END PGP SIGNATURE-----

Dale Newfield

5:16 p.m.

I'm all for someone taking ownership of this long-neglected component -- thank you for doing so!

Barry Warsaw wrote:

...

The resistance to basing this on message-id has always been that there's no guarantee of uniqueness... ...but I believe each list has some sort of counter for how many messages it's seen, so we could add another header with that number, and use as a unique id the two concatenated together... (That way the archiver can know from the content of the header exactly how to generate the same unique id as mailman, which would allow for the url-in-the-footer to happen w/o first hitting the archiver.)

Just throwing out ideas, -Dale

Barry Warsaw

12:19 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 4, 2007, at 1:16 PM, Dale Newfield wrote:

...

I'm not crazy about this idea for a couple of reasons. First, it
means that someone who has a copy of the message that didn't come
from the list (e.g. one of the two you will get of this message),
cannot calculate this unique ID. Second, things can happen to a list
that might cause this sequence number to get corrupted. Maybe a list
will get deleted and then recreated. Maybe it will get moved and the
sequence number will get reset in the move. Maybe the list will be
upgraded to a new version of Mailman.

I think we can do just as well by using Message-ID + Date and get
very low collision rates.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCobXEjvBPtnXfVAQIHFQP/Sz6WVqyFmo0lraw0hyyP5x4AhgBPDQmA /rFfSBRGbdORLXA2Ss0YdhI5cy8n7LMSsLawgtSt+JA7F5IEiC6Hk5C1M8C+Oe09 4ICYEuuL+gcXPPVc4aYtxp33HvPBFCzPJkGBS2PHaqCQkYIKdWHCtDZ8iLWCOxjc b674lsQk9tM= =a09C -----END PGP SIGNATURE-----

Stephen J. Turnbull

1:31 p.m.

Barry Warsaw writes:

...

Second, things can happen to a list
that might cause this sequence number to get corrupted.

Add an X-Mailman-Sequence-Number header if not already present.

That doesn't deal with your other comments, but as I point out elsewhere, if you don't use *any* Mailman-specific information in the global ID, you have no sane way to handle collisions except throw them away (or make the global ID refer to a collection resource, but that's kinda unintuitive).

Barry Warsaw

2:07 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 20, 2007, at 9:31 AM, Stephen J. Turnbull wrote:

...

I'd probably call it X-List-Sequence-Number and I'd have to ensure
that archive copy had that header in it. OTOH, if I'm going to go to
the trouble of adding this sequence number, why not just calculate a
(more likely) gid for the message myself? If I did that, I could use
a tinyurl scheme and get much shorter urls. The archiver would then
be obliged to use my X-List-GID header verbatim.

I've been pushing for calculating this using non-Mailman headers
because I'd /like/ for a client receiving the non-list copy to be
able to make the same calculation. OTOH, maybe we can have it both
ways.

So, we calculate the sequence number and generate the following headers:

X-List-Sequence-Number: 801 X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

The latter is composed of purely author generated data, the former is
supplied by Mailman.

Assuming we also had this header:

List-Archive: http://archive.example.com/gid/

then the following url would point to the same exact resource:

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801

If however we subsequently got a collision, then these two urls would
address different resources. E.g.:

X-List-Sequence-Number: 2112 X-List-Message-GID: RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

Now the two messages would still be addressable by their respective
urls:

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/801 http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI/2112

but

http://archive.example.com/gid/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

would be a disambiguation page. For a web u/i it would be an HTML
list containing relative links to '801' and '2112'. A RESTful XML
document would contain the set of links to the subordinate pages. A
client of the archive.example.com service would have to be prepared
to handle disambiguation pages if it used only the author generated
GID, but it would be guaranteed that the full url would lead directly
to one and only one email message.

Archives would have to recognize the X-List-Sequence-Number and honor
it whenever it regenerated its archives so that the urls would remain
stable.

Thinking about this more (and I've been up since about 3:30am so I'm
a little foggy right now ;), we may want to optimize for fewer dupes
rather than fewer collisions, or maybe it doesn't matter. It would
be interesting to see how big the message-id buckets are when only
using the Message-ID header.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDBtHEjvBPtnXfVAQLOggQAhIjxlU2jPDb5K8Lfe3NThjgwKiPblqtm UurUj+AZCffS1ewGDlV6y3GGRnHEzdVSIVvAiATEGTRVG8Zzbbev3GXs0EKYiEyL FZreNcPqDAPL0KSGw73RdAiwZuszfQcMTsSwOx98zS9Kz0NtbntYQTuqQZwo7wAW 3KeGe2PkpaI= =yhaZ -----END PGP SIGNATURE-----

Jeff Breidenbach

7:30 p.m.

...

I'd suggest the reverse. Keep the canoncical archive URL short and sweet, and then use a URL redirection service to map message-id's to those URLs. It is the archiver's job to make it all work. For example, the canonical archive URL might stay exactly the way it is in pipermail. But the archival link embedded in the message would instead go to a redirection service.

http://mail.codeit.com/pipermail/zcommerce/2002-February/000523.html http://mail.codeit.com/msgid?002701c4eb3d$07170ca0$3142003e@ADSL

The one other thing I'd ike to revisit is integration with third party archival services. There are two obvious integration points; one is a button in the Mailman list admin user interface that says "archive with service X" not unlike the setting in Firefox that basically says "search with service X". The other integration point is the archival link discussed above. In which case it would be set to something like.

http://third-party-service/msgid?002701c4eb3d$07170ca0$3142003e@ADSL

Disclosure: I help run a third party archiving service, and this topic was discussed quite a bit previously. [1] Nonetheless it seems like a good time revisit given the current discussion about archive wishlists.

[1] http://www.mail-archive.com/mailman-developers@python.org/msg08772.html

Jeff Breidenbach

4:48 a.m.

...

In which case [the message body link] would be set to something like.

http://third-party-service/msgid?002701c4eb3d$07170ca0$3142003e@ADSL

Just for fun, I did a trial implementation. It works, but the URLs are too long. For example, the URL below spends 59 characters on the messag-id, and 27 characters on the listname. We're already over my comfort level (of about 72 characters) and haven't even started to count the hostname, and other URL-lengthening overhead. Maybe this was a bad idea after all.

http://www.mail-archive.com/search?l=mailman-developers%40python.org&q=e03b90ae0707041230m47110705t89cdbe3d2e4802cd@mail.gmail.com

Jeff

Barry Warsaw

12:27 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 4, 2007, at 3:30 PM, Jeff Breidenbach wrote:

...

I agree. My proposed global message id is exactly the canonical
archive URL, although it's relative to the archiver's base url, as
given in the List-Archive header.

...

I think we could define an interface that archive services would have
to meet in order to be available to list admins. The site admin
would of course have to enable them site-wide first. Why kinds of
information would be required?

- List-Archive base url
- Message injection procedure
- Additional subscription procedures

The nice thing is that if my global id idea works, the injection
process can be completely asynchronous.

...

All we'd need to know is the third party's List-Archive header
value. The last part of the path would always be the global message id.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCqSnEjvBPtnXfVAQJq7gQArkmEb3DqrOaRTdYnQ0SCOrqWtiPxNJOd 555+JiHt/mEqPTuS/cF1GfdckwrQXbUJYWeO56dXzfbXtCVaW54h4k/95RI2/mqK HR2BKcoVW/dDfYUd2V2Vbqdc7trVIy3oGdzQb24Pu9bIptqbdVSpnmx8jm9GIOi1 UAkJp+Ff5nc= =lE32 -----END PGP SIGNATURE-----

Barry Warsaw

12:05 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 2, 2007, at 11:06 PM, Terri Oda wrote:

...

Since I've largely finished up the coding contract that was eating up a lot of my time, I'm thinking that I'd like to do some coding for fun. And nothing says fun like trying to fix the Mailman archives! ;)

That would be awesome Terri! It's an aspect of Mailman that sorely
needs attention, and you will gain (even more) fame and fortune by
working on it. :) I totally support this effort.

...

I'm trying to remember all the things people have suggested for the archives in the past so I can figure out what needs to be done and what might be nice to have, and see if this is doable in the time I have in the foreseeable future.

The big things people wanted most, if I recall correctly, included:

modernized HTML/CSS/Themes (preferably to match a modernized web interface... is that all set up now?)

It's not, but Andrew Kuchling will be working on this. I haven't yet
revealed detailed plans, though I'm working on an email about this
over the U.S. July 4th holiday. But I suppose it's time for a quick
summary: I'd like to get a Mailman 2.2 out with an updated u/i sooner
rather than later, and if possible an updated archiver would be one
of those few other new features that I think could go into a 2.2.
OTOH, it would be fine if we pushed that off to Mailman 3 too, but it
leveraged all the u/i work to be done in 2.2.

...

archive links that won't break if the archive is rebuilt

Yes, this is absolutely critical, in fact, I'd put it right at the
top of the list, even more so than a u/i overhaul. Stable urls, with
backward compatible redirecting links if at all possible, would be
fantastic.

Along with that, I would really like to come up with an algorithm for
calculating those urls without talking to the archiver. This would
allow the list delivery queue to calculate the List-Archive: header
value and any message header/footer substitutions before the message
hits the archiver.

...

better address obfuscation (maybe by generating pages through cgi)

I'd still love to do this, and I think were it not for crawlers, we
could get a lot of mileage out of creation on demand and caching.
But how do you handle Google crawling your archive?

...

search

Another huge huge feature.

...

not adding a billion dependencies to Mailman

Definitely. I'm also not opposed to changing the interface between
Mailman and the archivers if necessary.

...

Here's the list from the wiki's Mailman 2.2 page: http:// wiki.list.org/display/DEV/Mailman+2.2

We should probably start a separate archiver wiki page. I plan on re- organizing the 2.2 page anyway, so I'll probably end up doing that if
you don't get around to it before me <wink>.

...

(1) Is anyone working on this already?

Not that I know of.

...

(2) What else is on people's wish lists for a pipermail replacement?

Other things high on my list are ditching the crufty storage
currently being used (pickles begone!), an RSS feed, and a 'message
storage' which could be used to vend archived messages through other
delivery transports, such as imap or nntp. But I'd be willing to put
all that off for stable urls, an updated u/i, and searching.

Anything I can do to help, please let me know.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRorkOHEjvBPtnXfVAQLw0wP/TFgXxFAcK+3QiDG4jkyPCVVpP0EqATwB nYfUDrf0ytuTphFMM4gJmWbZdtR1HJ2xqNOit18QTsM/pjTiIDB++nH0IoRkRwy3 qs4JdBb+m3Amuxaaa4dQp+nWQt2yUMsF/HWp3BS/vx8oCfkjMhOKDI29/UG9jU+L L64QzWeywGw= =ewlo -----END PGP SIGNATURE-----

Stephen J. Turnbull

7:49 a.m.

Barry Warsaw writes:

...

+1. I've been wanting to do something about this, and have made proposals (not back with code, mea maxima culpa) for design. I would definitely be happy to help with this, but given time constraints, it would be nice if somebody else could take the lead.

...

Along with that, I would really like to come up with an algorithm for
calculating those urls without talking to the archiver.

Brad didn't like this when I suggested it before, but I didn't really understand why not. Anyway, FWIW:

I suggest adding an X-List-Received-ID header to all messages. I haven't really thought through whether the UUID in that field should be at least partly human-readable or not, but that doesn't matter for the basic idea.[1] The on-disk directory format would be

/path-to-archive/private/my-list/Message-ID

for singletons (Message-ID is the author-supplied ID) and

/path-to-archive/private/my-list/Message-ID/List-Received-ID

for multiples. These would be created on-the-fly when they occur. They can be served as static pages. For almost all messages, the bare URL

http://archives.example.com/my-list/Message-ID

should Just Work (ie, return a no-such-object result or a single message). Where it does not, you get an index of all pages with that message ID.

The main drawback to using Message IDs that I can see is that broken MUAs may supply no Message-ID, or the same one repeatedly. In the former case, as a last resort Mailman can supply one, but that won't help people who get a personal copy and want to find the thread. However, I see no way to help them, anyway, beyond a generic archive search engine. In the latter, you get lots of messages matching the Message-ID, and while most lists should have *zero* problems, a list that has any instances of this problem would have many. Again I can't see a good way to deal with this other than a general search facility, as computing a digest of headers or content is hard to do reliably. Providing an index of matching posts seems like a reasonable approach, which can be efficiently implemented (eg, as static pages). Furthermore, the examples I've seen of both in the last few years have all been either spam or (in the case of duplicate Message-IDs) actual duplicates due to some mail system problem or itchy user fingers.

A minor drawback to my proposal is that if a message gets archived as a singleton for that Message-ID, then a duplicate arrives, previously created references in the archive will of course now return an index rather than the desired message. Ie, there is data corruption. This can be dealt with in several ways; the easiest would be to provide a "if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking-for-me" link when creating the directory for multiple instances.

There's also a *very* minor benefit: repeat sends will be immediately recognizable without checking Message-ID.

Footnotes: [1] By partly human-readable I mean containing list-id and date information. The idea would be to have the date come first, so that users would have a shot at identifying which of several messages is most likely, and this would be searchable by eye with simply an ordinary sorted index.

John A. Martin

4:58 p.m.

...

st> The main drawback to using Message IDs that I can see is that
st> broken MUAs may supply no Message-ID, or the same one
st> repeatedly.  In the former case, as a last resort Mailman can
st> supply one,

If the archive is considered to be a reflection of what Mailman _put_ on the wire, as distinct from what was received from the wire, then adding a Message-ID in the absence one already present is a reflection of a SHOULD requirement of rfc(2)822. In the absence of a Message-ID on an outgoing mail message many if not most MTAs will add one. Why not let Mailman anticipate the need to add a Message-ID when archiving the message rather than leaving it to the outgoing MTA?

jam

Stephen J. Turnbull

3:09 a.m.

John A. Martin writes:

...

Quite.

My reason for saying "last resort" is simply that this is not predictable to third parties. Eg, I send you (a non-subscriber) a message with CC and no Message-ID. You'd like to find the thread in the archives. You may as well just do a linear search on that month's threads.

An URL based on an MD5 of the message body in theory would work, but in the presence of non-ASCII bodies, structured MIME, ML digests, and various MTA autoconversions, that seems fragile.

Barry Warsaw

12:45 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 9, 2007, at 11:09 PM, Stephen J. Turnbull wrote:

...

Yep, and I say "tough". Let John complain to Stephen to fix his MTA
to add those Message-IDs so Mailman doesn't have to. ;)

...

Agreed, and it would do no better, in fact worse, than base32(sha1 (message-id + date))

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCuW3EjvBPtnXfVAQKx/AP9EUxDQmp1tiCEqJqVSFWeicq/9lThnMZN 58UUEPA47wPa1SJSk6z7+0vSfqTskwO1Frnn8OJ6X+MJAxCX4Hr86uBOnK9XW2AK byCfeYHBdapGlrsxmPd0so+FFJODWWRu7+yyKTw6ApDwVevatEEIMPlZkMALMv5S axC5ttHfR2E= =c0pw -----END PGP SIGNATURE-----

Barry Warsaw

12:02 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 4, 2007, at 3:49 AM, Stephen J. Turnbull wrote:

...

I think this suggestion has merit, but I'm going to riff on it a bit.

First, I want to avoid talking about file system layout. To me,
that's an implementation detail we needn't worry about right now.
Maybe the files will live on disk, maybe they'll live in a database,
maybe they'll live in an external system we don't control. I don't
care. What I want is a uniform way to calculate an address for a
message given nothing but its text and an interface for retrieving
messages from a service given that address. I'm thinking about this
in a RESTful way, and it's perfectly legitimate for that 'message
address' to be relative to some archive or message store root.

I've done some experiments. I took the top 5 mbox files on
python.org and ran them through a script that looked for message-id
collisions. Then I implemented 6 strategies for looking at whether
the collisions were true collisions or duplicates. Duplicates are
defined where every message in the same message-id bucket has the
same match criteria, and collisions are where at least one message in
the bucket is different. So for example, with strategy 2, if the
message-id and date headers are the same for every message in the
bucket, it's a dupe, otherwise it's a collision.

While I ran the script over each mbox separately, I think it's more
interesting to talk about them as a whole collection. I don't really
know how representative this would be of the world at large, but it's
interesting anyway. FTR, the lists were mailman-users, python-dev,
python-help, python-list, and tutor. I think there would be little
intentional cross-posting between these lists. Here are the numbers:

total 325146, missing: 624

msg.as_string(), dup: 34 (0.0104568409268%), col: 914
(0.281104488445%)
message-id + date, dup: 875 (0.269109876794%), col: 73
(0.0224514525782%)
message-id + 1st received, dup: 270 (0.0830396191249%), col: 678
(0.208521710247%)
message-id + all received, dup: 270 (0.0830396191249%), col: 678
(0.208521710247%)
message-id + date + 1st received, dup: 268 (0.0824245108351%),

col: 680 (0.209136818537%) 6. body_line_iterator(msg), dup: 659 (0.202678181494%), col: 289 (0.0888831478782%)

Notice that of 325146 total messages, 624 of them had no message-id
header. Even if you aggregate dup+col, you're still looking at a
total duplicate rate of 0.29%. While I'm almost tempted to ignore a
hit rate that low, if you think of an archive holding 1B messages,
you still get a lot of duplicates.

OTOH, the rate goes down even lower if you consider the message-id
and date headers. (Note, I did not consider messages missing a date
header). How likely is it that two messages with the same message-id
and date are /not/ duplicates? Heck, at that point, I'd feel
justified in simply automatically rejecting the duplicate and
chucking it from the archive.

I spent a /little/ time looking at the physical messages that ended
up as true collisions. Though by no means did I look at them all,
they all looked related. For example, with strategy 2 some messages
look like they'd been inadvertently sent before they were completed.
I need to see if there's any similarities in MUA behind these, but
again, I think we might be able to safely assume that collisions on
message-id+date can be ignored.

That leads me to the following proposal, which is just an elaboration
on Stephen's. First, all messages live in the same namespace; they
are not divided by target mailing list. Each message has two
addresses, one is the Message-ID and one is the base32 of the sha1
hash of the Message-ID + Date. As Stephen proposes, Mailman would
add these headers if an incoming message is missing them, and tough
luck for the non-list copy. The nice thing is that RFC 2822 requires
the Date header and states that Message-ID SHOULD be present.

Why the second address? First, it provides as close to a guaranteed
unique identifier as we can expect, and second because it produces a
nearly human readable format. For example, Stephen's OP would have a
second address of

...

I like base32 instead of base64 because the more limited alphabet
should produce less ambiguous strings in certain fonts and I don't
think the short b64 strings are short enough to justify the
punctuation characters that would result. While RFC 3548 specifies
the b32 alphabet as using uppercase characters, I think any service
that accepts b32 ids should be case insensitive. A really Postel-y
service could even accept '1' for 'I' and '0' for 'O' just to make it
more resilient to human communication errors.

I'd like to come up with a good name for this second address, which
would suggest the name of the X- header we stash this value in. X- B32-Message-ID isn't very sexy. Maybe X-Message-Global-ID, since I
think there's a reasonable argument to make that for well-behaved
messages, that's exactly what this is.

So now, think of the interface to a message store that supports this
addressing scheme. Well it's something like:

class MessageStore(Interface): def store_message(message): """Store the message.

     :raises ValueError: when the message is missing either the

Message-ID header or a Date header. :raises DuplicateMessageError: when a message in the store
already has a matching Message-ID and Date. An archive is free to raise
this exception for duplicate Message-IDs alone. """

 def get_message_by_global_id(key):
     """Locate and return the message from the store that matches

key.

     :param key: The Global ID of the message to locate.  This is

the base32 encoded SHA1 hash of the message's Message-ID and Date headers. :returns: The message object matching the Global ID, or None
if there is no such match. """

 def get_messages_by_message_id(key):
     """Return the set of messages with a matching Message-ID `key`.

     :param key: The Message-ID of the messages to locate.
     :returns: The set of all messages in this store that have

the given Message-ID. If none such matches are found, the empty set is returned. """

As far as generating pages based on the Message-ID or global id, I
agree with Stephen's proposal. A page returned in response to a
message-id request could return the message page or it could return
an index of such messages. It would be up to the archive whether it
would accept duplicate Message-IDs or not, but it would always be
guaranteed that a page returned in response to a global id request
would return one email message.

Urls could be calculated by concatenating the List-Archive and X- Global-Message-ID headers, e.g.

http://mail.python.org/pipermail/mailman-developers/ RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

would be the OP. This could point to the same resource as

http://mail.python.org/pipermail/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI http://mail.python.org/pipermail/global/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

and /might/ point to the same resource as:

http://mail.python.org/pipermail/mailman-developers/ 87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp http://mail.python.org/pipermail/mids/ 87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp

...

Or by using the global id, or by rejecting messages with duplicate
message ids.

...

I see searching, indexing, sorting, and providing other human
readable urls into the message store as a function of the archive.
Once you're looking at a link to the actual message, you're going to
be looking at a url that contains the global id, regardless of the
number of levels you have to go through or redirects involved.

Apologies for letting this thread linger so long. I'm very
interesting in hearing your thoughts and if there's general
agreement, I'll write it up in the wiki.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCkWnEjvBPtnXfVAQIRhAP7BkuF5K0xOuie2GBqOOWDarksD5Oy49y9 /2WO+u4xH+BttIt3adHJS+K6ETYcK79c5Rf4uwZk40DqWKK7ay1zkxUn/LGXOJ0o CoWQG5ZyFUJUTkDXtxEWcZ8kkXaDTTSNz2eCtYgQAXw77A95E1SjV0YBs54bFK3A Bi9cjrKRDcM= =pyY6 -----END PGP SIGNATURE-----

Stephen J. Turnbull

1:21 p.m.

Barry Warsaw writes:

...

First, I want to avoid talking about file system layout. To me,
that's an implementation detail we needn't worry about right now.

Agreed.

...

How likely is it that two messages with the same message-id and date are /not/ duplicates?

For message id generators that include a time-stamp in the generated id, approximately the same as the probability that two messages with the same message-id are not duplicates, no?

...

Heck, at that point, I'd feel justified in simply automatically rejecting the duplicate and chucking it from the archive.

I'd rather not go there. There may be applications for the archiver that require that all mail received be filed.

Counterproposal: have a "collisions" namespace, and provide an interface for the list owner to decide what to do with them. They could be thrown away, they could be given an alternative global ID somehow and added (eg, the archive page could add a "See probable duplicates too" link), or they could be put into a moderation-like queue for list admins to decide about.

...

So now, think of the interface to a message store that supports this
addressing scheme. Well it's something like:

I don't understand how the calling application is supposed to deal with a DuplicateMessageError exception since it should not change either the Message-ID or the Date if present.

I see this as a major problem with any proposal to use only author headers in computing the "global id".

...

Or by using the global id, or by rejecting messages with duplicate
message ids.

Er, the MTA has already accepted it. Do you plan to generate a list manager bounce to the poster? This has the unpleasant misfeature that it could be used to bounce spam off the list manager, since the poster needs to see content to determine whether this is a multiple send or actually the "intended version" after a "fat-finger" send; we already know the message-id isn't good enough.

Barry Warsaw

1:49 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 20, 2007, at 9:21 AM, Stephen J. Turnbull wrote:

...

Good point, though clearly not all message-ids have timestamp
information in them. It does help explain why I see 600-odd more
collisions when taking other data into account too. I've modified my
script to sort collisions and dupes into maildir folders, so I'll
take a closer look when that finishes running (it takes a long time
to slog through all 5 mboxes, even on a fairly zippy dual-G5).

...

True. It would ultimately be an archiver policy though.

...

I like this.

...

Mailman would probably log and ignore DuplicateMessageErrors. It
wouldn't be Mailman's responsibility to ensure the message gets
archived, although I concede that as currently defined, you could end
up with list copies that had a global id header that wasn't unique.
OTOH, if the archiver implements a collision resolution policy such
as a 'collisions' namespace, it wouldn't ever raise
DuplicateMessageError.

...

Yes, this wouldn't be an MTA bounce, it would be a Mailman bounce.
But it would have to be subject to the same bounce rules as any other
auto-response which could be used as a spam vector, e.g. limit the
number of bounces per time period and don't include the entire
original message in the bounce (as both can be, and are used as spam
vectors).

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqC9fnEjvBPtnXfVAQLkEQQAhdu0BIvpRvTk92m9J/sbHVRSRxBGMqta Cm57WyRJGBxPV3xTE4ghVzXdDyIEvUjKimRTEWbeX60WqROL6FPsmAnwmsYbW3mw 8hqNXj+SpHP+1GIYnYgY9txiM75fHDa5T0VsjpcXAwtjeepHouXAEWbegBUrIzHt EBp5YCMqxv8= =5tjc -----END PGP SIGNATURE-----

Stephen J. Turnbull

5:21 p.m.

Barry Warsaw writes:

...

But that prevents detecting a prematurely sent message, which is presumably a common use case for genuine collisions.

I just don't think bouncing back is going to be very useful; either you don't give the user the information he needs to figure out what happened, or you give the spammers a vector.

Jeff Breidenbach

6:02 a.m.

...

Message ID's are supposed to be unique. This is discussed in in RFC 822: 4.6.1 and RFC 1036: 2.1.5, and probably other places. If that's not the case, the mail transfer agent is broken. I think it's better to go ahead and use the mesage-id, rather than concoct yet another "this time we mean it!" unique identifier. This is a cost/benefit thing; the cost is some real world collisions, the benefit is a conceptually simpler system. Conceptually simpler things are good especially when implemented all over the place.

Which brings me to suggestion #2, which is go ahead and write an RFC on how list servers should embed archival links in messages. This sounds like an internet wide interoperability issue as much as something mailman specific. Why not come up with a scheme usable by all list servers? And also describe a specification third party archival services can comply to. Besides, I've always wanted to help write an RFC. If we go that route, it would be good to get input from a range of people - one person I'd suggest is Earl Hood, author of mhonarc.

Thoughts?

Jeff

While I'm almost tempted to ignore a

...

hit rate that low, if you think of an archive holding 1B messages, you still get a lot of duplicates.

OTOH, the rate goes down even lower if you consider the message-id and date headers. (Note, I did not consider messages missing a date header). How likely is it that two messages with the same message-id and date are /not/ duplicates? Heck, at that point, I'd feel justified in simply automatically rejecting the duplicate and chucking it from the archive.

I spent a /little/ time looking at the physical messages that ended up as true collisions. Though by no means did I look at them all, they all looked related. For example, with strategy 2 some messages look like they'd been inadvertently sent before they were completed. I need to see if there's any similarities in MUA behind these, but again, I think we might be able to safely assume that collisions on message-id+date can be ignored.

That leads me to the following proposal, which is just an elaboration on Stephen's. First, all messages live in the same namespace; they are not divided by target mailing list. Each message has two addresses, one is the Message-ID and one is the base32 of the sha1 hash of the Message-ID + Date. As Stephen proposes, Mailman would add these headers if an incoming message is missing them, and tough luck for the non-list copy. The nice thing is that RFC 2822 requires the Date header and states that Message-ID SHOULD be present.

Why the second address? First, it provides as close to a guaranteed unique identifier as we can expect, and second because it produces a nearly human readable format. For example, Stephen's OP would have a second address of

...
...
...
mid '<87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp>' date 'Wed, 04 Jul 2007 16:49:58 +0900' # XXX perhaps strip off angle brackets h = hashlib.sha1(mid) h.update(date) base64.b32encode(h.digest()) 'RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI'

I like base32 instead of base64 because the more limited alphabet should produce less ambiguous strings in certain fonts and I don't think the short b64 strings are short enough to justify the punctuation characters that would result. While RFC 3548 specifies the b32 alphabet as using uppercase characters, I think any service that accepts b32 ids should be case insensitive. A really Postel-y service could even accept '1' for 'I' and '0' for 'O' just to make it more resilient to human communication errors.

I'd like to come up with a good name for this second address, which would suggest the name of the X- header we stash this value in. X- B32-Message-ID isn't very sexy. Maybe X-Message-Global-ID, since I think there's a reasonable argument to make that for well-behaved messages, that's exactly what this is.

So now, think of the interface to a message store that supports this addressing scheme. Well it's something like:

class MessageStore(Interface): def store_message(message): """Store the message.
     :raises ValueError: when the message is missing either the
Message-ID header or a Date header. :raises DuplicateMessageError: when a message in the store already has a matching Message-ID and Date. An archive is free to raise this exception for duplicate Message-IDs alone. """
 def get_message_by_global_id(key):
     """Locate and return the message from the store that matches
key.
     :param key: The Global ID of the message to locate.  This is
the base32 encoded SHA1 hash of the message's Message-ID and Date headers. :returns: The message object matching the Global ID, or None if there is no such match. """
 def get_messages_by_message_id(key):
     """Return the set of messages with a matching Message-ID `key`.

     :param key: The Message-ID of the messages to locate.
     :returns: The set of all messages in this store that have
the given Message-ID. If none such matches are found, the empty set is returned. """

As far as generating pages based on the Message-ID or global id, I agree with Stephen's proposal. A page returned in response to a message-id request could return the message page or it could return an index of such messages. It would be up to the archive whether it would accept duplicate Message-IDs or not, but it would always be guaranteed that a page returned in response to a global id request would return one email message.

Urls could be calculated by concatenating the List-Archive and X- Global-Message-ID headers, e.g.

http://mail.python.org/pipermail/mailman-developers/ RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

would be the OP. This could point to the same resource as

http://mail.python.org/pipermail/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI http://mail.python.org/pipermail/global/RXTJ357KFOTJP3NFJA6KMO65X7VQOHJI

and /might/ point to the same resource as:

http://mail.python.org/pipermail/mailman-developers/ 87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp http://mail.python.org/pipermail/mids/ 87myycy5eh.fsf@uwakimon.sk.tsukuba.ac.jp

...
A minor drawback to my proposal is that if a message gets archived as a singleton for that Message-ID, then a duplicate arrives, previously created references in the archive will of course now return an index rather than the desired message. Ie, there is data corruption. This can be dealt with in several ways; the easiest would be to provide a "if-you-got-here-by-clicking-a-ref-from-this-archive-you're-looking- for-me" link when creating the directory for multiple instances.

Or by using the global id, or by rejecting messages with duplicate message ids.

...
There's also a *very* minor benefit: repeat sends will be immediately recognizable without checking Message-ID.

Footnotes: [1] By partly human-readable I mean containing list-id and date information. The idea would be to have the date come first, so that users would have a shot at identifying which of several messages is most likely, and this would be searchable by eye with simply an ordinary sorted index.

I see searching, indexing, sorting, and providing other human readable urls into the message store as a function of the archive. Once you're looking at a link to the actual message, you're going to be looking at a url that contains the global id, regardless of the number of levels you have to go through or redirects involved.

Apologies for letting this thread linger so long. I'm very interesting in hearing your thoughts and if there's general agreement, I'll write it up in the wiki.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCkWnEjvBPtnXfVAQIRhAP7BkuF5K0xOuie2GBqOOWDarksD5Oy49y9 /2WO+u4xH+BttIt3adHJS+K6ETYcK79c5Rf4uwZk40DqWKK7ay1zkxUn/LGXOJ0o CoWQG5ZyFUJUTkDXtxEWcZ8kkXaDTTSNz2eCtYgQAXw77A95E1SjV0YBs54bFK3A Bi9cjrKRDcM= =pyY6 -----END PGP SIGNATURE-----

Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-developers/jeff%40jab.org

Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp

Stephen J. Turnbull

6:56 a.m.

Jeff Breidenbach writes:

...

...
Notice that of 325146 total messages, 624 of them had no message-id header. Even if you aggregate dup+col, you're still looking at a total duplicate rate of 0.29%.

Message ID's are supposed to be unique.

Fortunately, a rule more honored in the observance than the breach. Nonetheless, it *is* breached. The Postel Principle applies here, IMO.

...

better to go ahead and use the mesage-id, rather than concoct yet another "this time we mean it!" unique identifier.

That's not the point. We're not going to impose this on senders; that's what Message-ID is for, as you say. If a sender won't provide a proper Message-ID, third parties who get a CC are just out of luck.

I simply think we should be prepared for applications where relying on the sender to supply a UUID is not acceptable; we need to be able to provide one ourselves. Creating UUIDs is a solved problem, after all. So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL.

Then we say that an archive SHOULD provide access to the resource via Message-ID if available, and define how to construct that URL from the List-Archive and Message-ID headers.

...

Which brings me to suggestion #2, which is go ahead and write an RFC on how list servers should embed archival links in messages.

I think Barry already suggested that? Anyway, +1. But remember, a standards-track RFC should have a working implementation to point to.

John A. Martin

11:10 a.m.

...

st> Jeff Breidenbach writes:
>> > Notice that of 325146 total messages, 624 of them had no
>> > message-id header.  Even if you aggregate dup+col, you're
>> > still looking at a total duplicate rate of 0.29%.
>>
>> Message ID's are supposed to be unique.

st> Fortunately, a rule more honored in the observance than the
st> breach.  Nonetheless, it *is* breached.  The Postel Principle
st> applies here, IMO.

Taking "be conservative in what you do" as being at least as important as "be liberal in what you accept from others", the devil can quote this scripture to support simplicity in this instance, IMHO.

>> better to go ahead and use the mesage-id, rather than concoct
>> yet another "this time we mean it!" unique identifier.

st> That's not the point.  We're not going to impose this on
st> senders;

I read the quote as meaning "this time we mean it really is unique", imposing nothing on senders.

st> that's what Message-ID is for, as you say.  If a sender won't
st> provide a proper Message-ID, third parties who get a CC are
st> just out of luck.

Right. Maybe that will encourage compliance. The complexity of catering to brokenness in this instance may be too high a price to impose on the all.

jam

Stephen J. Turnbull

11:55 a.m.

John A. Martin writes:

...

>> better to go ahead and use the mesage-id, rather than concoct
>> yet another "this time we mean it!" unique identifier.

st> That's not the point.  We're not going to impose this on
st> senders;
I read the quote as meaning "this time we mean it really is unique", imposing nothing on senders.

Ah. If so, my reply is "if you want something done right, do it yourself." *All robust databases assign a unique ID to each record.* Why shouldn't a mailing list archive do so?

...

What complexity? Mailman just does

msg['X-List-Archive-Received-ID'] = Email.msgid()

(or however the message ID generator is spelled). After that, it's up to the archiver whether to do anything with it or not. I proposed a way that it could be used; if that's considered too complex, fine. But simply assigning one is not complex or otherwise very costly.

Jeff Breidenbach

4:31 p.m.

There are three different parties coming to the table. One is the mail transfer agent of the sender, another is the list server, and the third is the archive server. Ideally, all three will be happy campers.

...

So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL.

It is the archive server's job to decide what is the "canonical" URL for a message. There's a good chance these archival URLs will be served by an HTTP redirect. So let's not use the word canonical. :)

...

What complexity? Mailman just does

msg['X-List-Archive-Received-ID'] = Email.msgid()

Easy to introduce, harder to deal with. The archival server would now keep track of both the message-id and the x-list-archive-received-id. That's two namespaces that almost do the same thing. It's easier for the archive server to keep track of one name space than two, and - most importantly - conceptually simpler.

...

From the perspective of the assorted list servers, it's easier to do nothing than to do something. So if they can get by with just message-id (which is already implemented) not have to add x-list-archive-received-id, that's a smoother implementation path. If we base on message-id, archival servers will be able to retroactively add support for all their stored messages, even those that are ten years old. And users holding an old message will be able to figure out that URL without doing any computational gymnastics.

Put another way, there's the possibility to reduce the archive servers' implementation to "search for this mesage-id" which is something really useful to have anyway, and therefore likely to get wider support.

In addition, Barry was talking about concocting a unique identifier from the Date field and Message-ID. I'm not a big fan of this idea, because the date field comes from the mail user agent and is often wildly corrupt; e;g; coming from 100 years in the future. Very painful if the archive is showing most recent message first. Therefore an archival server is very likely to determine message date from the most recent received header (generally from a trusted mail transfer agent) rather than the date field. From the archive server's perspective, the best thing to do with the date field is throw it away.

So for these reasons, I'd rather stick with message-id and risk some real world collisions, instead of introduce another identifier. If the list server receives a message with no message-id, by all means create one on the spot. To me, this feels like the sweet spot in terms of cost benefit. The main thing that bugs me is message-ids are long, which makes them awkward to embed in a URL in the footer of a message.

Jeff

Dale Newfield

4:43 p.m.

Jeff Breidenbach wrote:

...

Oh--I was assuming the Date to which he was referring was the current timestamp at which mailman was processing the message. I was going to say that this guarantees uniqueness, but I guess there are parallel mailman implementations where more than one machine/processor are all serving the same list, and then two different machines/processors might wind up with identical timestamps while processing two different messages.

-Dale

Gustav H Meyer

9:30 a.m.

Hi,

I think this is the first time that I'm posting here but hopefully not the last. Thanks to everyone involved for an incredible project. I'm not much of a developer but I like practical solutions and will do everything possible to help improve in this area even if it's just to give some feedback.

I'm very excited about this project and can't wait for the next version to come out with full integration between web forum and mailing list. I like this idea very much and it seems that we're going to see it real soon. :)

On 24/07/2007 18:43, Dale Newfield wrote:

...

I also like the idea of seeing the date somewhere in the URL but IMHO we also need to see a unique sequential number. How about the following idea:

http://my.list.server/archivebase/mylist/200707240001/msg00001/ http://my.list.server/archivebase/mylist/200707250001/msg00002/ http://my.list.server/archivebase/mylist/200707250002/msg00003/

and at the same time allow the following: http://my.list.server/archivebase/mylist/msg00001/ http://my.list.server/archivebase/mylist/msg00002/ http://my.list.server/archivebase/mylist/msg00003/

This way you can see exactly how many messages were sent on a day and how many messages have been sent since the start.

BTW the sequential number does in my view not have to be a decimal value. Anything short and sweet will do as long as you can work it out and at the same time allow for almost unlimited growth.

Just an idea.

Regards, Gustav H Meyer

Terri Oda

5:11 p.m.

On 24-Jul-07, at 12:31 PM, Jeff Breidenbach wrote:

...

Someone already pointed out that the message ID is a bit long for a
URL, so I'm guessing we're going to want some sort of shorter
sequence number for messages for linking purposes.

Regardless of whether we *need* to generate our own unique ID, I'm
leaning towards the thought that we're going to *want* to generate
our own for usability reasons. In a perfect world, i think we'd have
a sequence number so I could visit http://example.com/mailman/ archives/listname/204.html and know that 205.html would be the next
message to that list, but any short unique id would do if sequence
numbers are too much of a pain.

It seems silly to generate nice short links but then use message-id.
If we can generate nice short links, we might as well use 'em
throughout, unless you really think the default use of the archive
will be to search it by messageid (which I sincerely doubt, from my
user experiences).

Terri

Jeff Breidenbach

6:03 p.m.

...

I agree there's a lot of usability benefits from short URLs, but perhaps this is the job of the archive server, and not the list server. Mharc (an archive server) is a great example here. Mharc's canonical message format is pretty human friendly.

http://ww.mhonarc.org/archive/html/mharc-users/2002-08/msg00000.html

Unfortunately, there's no trivial way for the list server to know that human friendly URL when the message is sent out. Fortunately, Mharc is also happy handles messages by message-id, which the list server does know about.

http://www.mhonarc.org/archive/cgi-bin/mesg.cgi?a=mharc-users&i=200208010532.g715W0e31774@gator.earlhood.com

Had I been the implementer, I'd probably have made mharc do an HTTP 302 redirect from the longer URL to the shorter URL. But that's besides the point. The point is we have an existing, working, happy archival server, and it would be really nice if list servers (such as mailman) were compatible. And by compatible, I mean offering the capability of embedding an archival URL in the footers of messages.

-Jeff

Barry Warsaw

1:10 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 24, 2007, at 2:03 PM, Jeff Breidenbach wrote:

...

I agree, I just don't think message-ids are user friendly enough to
be this canonical url. Especially in this context, which is exactly
where urls are thrown in users faces. An archiving service is
exactly the right place for redirecting human readable urls to the
archiver's canonical url (by, I agree, 302).

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdLznEjvBPtnXfVAQJtxgQAiLp7TjnLoOLnpoxfli2gBo6fdU6ZIFb0 SKiuRgLAoTSdnJymYWOww2U/vTJ3HqR2dZNFCfGeVHgzoHpiX87WiZDJ4Sx1Jec8 7BpIO1ZokGI2NhHiSscYC5k4iCzce17lVGkyVzfYlFysmFKsFjcDIpV8wQFleeG9 TneLaMXT2eY= =1tKI -----END PGP SIGNATURE-----

Stephen J. Turnbull

4:40 p.m.

Barry Warsaw writes:

...

I'm confused (to be precise, you're confusing me). If human readable URLs are exactly right for redirection to the canonical URL, why does the canonical URL need to be user friendly?

A quick remark: the git SCM uses BASE16 SHA1s for object names, but allows you to abbreviate them to the unique prefix. A friendly archive could do the same for your BASE32 ids.

Without going much into implementation, here's how I would write the conformance section for our RFC. The point is that I don't see any need to discuss user-friendliness or the implementation of UUIDs for the RFC! This means that getting those right from the start is not that important.

0. Conformance

0.1 List managers

A conforming list manager MUST provide the List-Archive header
field if the post is being archived.

A conforming list manager MAY provide the List-Archive-UUID header
field.  If so, the value MUST be guaranteed unique, and it MUST be
present in the post as provided to the archiver.  The contents of
this header need not be distinct from the contents of the
Message-ID header, as long as the uniqueness guarantee is
maintained.

0.2 Archives

A conforming archive MUST reserve the namespaces "message-id/" and
"list-post-id/" relative to its base URL for the uses described
below.

A conforming archive MUST support retrieval by Message-ID, using
the namespace "message-id/$(MESSAGE-ID)" relative to its base URL.
The archive specified in the List-Archive header field MUST
support access using the value of that field as its base URL.

A conforming archive SHOULD support retrieval by UUID, using the
namespace "list-post-id/$(LIST-ARCHIVE-UUID)" relative to its base
URL.  If the scheme is "http" or "https", a conforming archive
that does not support retrieval by UUID SHOULD return status 501
NOT IMPLEMENTED with an entity explaining that retrieval by UUID
is not implemented.

A conforming archive MAY support "friendlyurls" for use where
space is constrained (eg, in a post's footer).  A conforming
archive may support any other URIs it wants to, too.&lt;wink>  A
third party SHOULD be able to regenerate a friendlyurl from the
original message contents.

0.3 Software

Conforming archive software SHOULD provide interfaces for
generating UUIDs and friendlyurls, if retrieval is supported.
Conforming list managers SHOULD use these interfaces.

Some comments:

The interfaces for generated URLs should be provided as command line utilities as well as callable functions.

Although the conformance level for friendlyurl support is "may", I expect that essentially all archives will support friendlyurls.

The namespace for UUIDs and friendlyurls should probably be more restricted than "any valid URI".

"List manager" denotes any source of archival content (eg, you could imagine a user storing their outbox in a archive, so that the "list manager" would actually be the user's MUA). The namespaces suggested above are good enough, I think, but there may be better ones.

Instead of 501 NOT IMPLEMENTED, I considered 410 GONE, but that implies a request to delete the reference. Since this is implemented as a header in the post, the archive could be augmented to support it later.

In the phrase "guaranteed unique", "guaranteed" means "to the level provided by uuidgen or standard Message-ID generators".

Generation of friendlyurls or unique ids based on message body content is probably a bad idea.

Barry Warsaw

1:06 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 24, 2007, at 1:11 PM, Terri Oda wrote:

...

Yes, definitely. What do you think of the base32 examples I have on
the wiki page?

...

We'd want sequence numbers in the urls if we think people will hand
edit them, say in a browser location bar. I'm not sure that's a
common enough use case.

Pipermail currently uses sequence numbers but there are big problems
with that. First, the mbox'ing algorithm wasn't always correct so
while sequence numbers were accurate when generating the html
archives on the fly, they broke horribly when you try to regenerate
them from an mbox file. It's also why we have tools like cleanarch
which tries to unbreak earlier mboxing bugs by crufty heuristics.
This /might/ be solved by ditching mboxes for maildir or some other
canonical raw archiving format (not a bad idea in its own right), but
manual surgery on the raw archives could still break it. Sometimes
site admins just /have/ to remove messages, disrupting the sequencing.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdK2XEjvBPtnXfVAQKfDQP/ToPZ3t7+uIyMrsThOr+PVQ7aKVT/BQ7F OgKqFSDSma4ZofQOkPgr4ZFRT1yKRURWas7jI2zQ8ADPAOKCYh0Udgq6XjpOI8mI 7/pODazVkbwzT9Oo06pGwpzaONK4eZjt1y9IDb9VkniUcAyve5EQ+5+KaG3rbo4M wsrCnHLkvSE= =/z/f -----END PGP SIGNATURE-----

Stephen J. Turnbull

4:56 p.m.

Barry Warsaw writes:

...

Yes, definitely. What do you think of the base32 examples I have on
the wiki page?

They're somewhat better than Message-IDs for readability, but they're not user-friendly.

...

On Jul 24, 2007, at 1:11 PM, Terri Oda wrote:

...
It seems silly to generate nice short links but then use message-id.

The use case for the message-id is not people. It's software, which doesn't much care about "nice short". But the developers debugging and maintaining the software will thank us for the ease of verifying that the URL goes to the right place.

Barry Warsaw

7:11 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 24, 2007, at 12:31 PM, Jeff Breidenbach wrote:

...

True, but an archiver already has to handle collisions on the Message- ID so in a sense, you have to maintain multiple paths to the same
message, don't you?

So I like my proposal because it imposing nothing additional on the
MUA or MTA, a tiny bit more on the MLM, and some extra work (though I
think not much) on the archiving agent. What you gain from my
proposal over a pure Message-ID approach is guaranteed uniqueness
given the list copy, and human friendlier urls.

...

All these are still true with my proposal, except with the
observation as Stephen points out that given a URL based on sender- provided headers, you must be prepared to deal with collisions, so
sometimes your resources will return lists. The advantage of adding
a bit of MLM-provided information is that given the list copy you can
guarantee uniqueness, and given the off-list copy you can get to a
resource that contains a link to the message you want.

...

Throw it away or hide it? The former would be a problem, but not the
latter. Does your archiver keep a canonical copy of the message as
you received it? If so, then you preserve the original Date header
enough for the calculation to occur, even if you hide the Date
header, or display a Received header date when you render it to
HTML. That doesn't matter of course.

But I should point out that I'm not married to including the Date
header in the hash. I like it because it appears to reduce
collisions which I care about. But I still like using the base32
sha1 hash instead of the raw Message-ID because I think it's easier
for humans to use, read, speak, and copy. Of course this doesn't
mean that you need to disable your search-by-Message-ID feature!

...

Another advantage for the URL scheme I propose. You know you're
going to end up with URLs of len(host-prefix) + 32 + 1 + #digits-in- seqno

(32 == base32(sha1digest(data)) (1 == / divider) (#digits-in-seqno == e.g. len(str(seqno))

You should be able to keep things in the 60-70 character range,
including the host name. That doesn't seem too bad.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZO4HEjvBPtnXfVAQIYGwP/VZPCiQrg9CTeMThApNTh7xUismbW0AiT 1N6a8DusXDBrqiLDQd+v2/R5KOV+TnwDNlIcl5FfFatHxWJ0bGy850kT/nhrHdKU UrW0hR8PWSMIRN5Bqx9bL9cvaMigAoyX+njAfiDgl0yy7arbAm66GH1HNH3c1XGT 1/qaGckINUg= =4uwH -----END PGP SIGNATURE-----

Jeff Breidenbach

4:47 a.m.

...

What you gain from my proposal over a pure Message-ID approach is guaranteed uniqueness given the list copy

Guarantee is a pretty strong word. A malicious person could post two messages with the same message-id, same date, but different bodies. Sometimes the channel between the MLM and the archive server will be SMTP, and spurious messages can be injected. Finally, from the archive server's perspective, some of the MLMs might make mistakes - just like from the MLM's perspective, some of MTAs might make mistakes in setting message-id. So I don't think the proposed SHA1(date, message-id) scheme buys a hard guarantee of uniqueness. Every component has to protect themselves, but none can solve the world's problems.

So that moves us to how many collisions are reduced in practice. I have a question about the numbers Barry mined from the python lists. Are the collisions really that high? One should not count messages without a message-id, because the MLM can and should create one in that case.

One should also not count collisions of messages going to different lists. Here's why. Let's say message M is cross posted to lists L1 and L2. Even though it is the same message, there are now two different contexts. (For example, people visit M at archive L1 should get a completely different experience if they hit "next message" and people visiting M at archive L2.)

So I'd be curious what the collision numbers come to with these two factors taken into account. The other takeaway is list name really should be part of the URL to get proper context. The earlier example from Mharc does this.

...

and human friendlier urls.

That's a very compelling point.

SHA1 can't be computed inside someone's head or simple cut-n-pasted together for old messages, but I think the usability benefits of short URLs (short enough that they can comfortably fit inside message bodies) outweighs this drawback. By the way, is SHA-1 still in favor? My impression was it was fading away after the Shandong University team partially cracked it.

...

Throw it away or hide [Date]? The former would be a problem, but not the latter.

Thrown away. My favorite archival service is based on mhonarc, and raw mail goes into offline cold storage. Of course this can be changed for the future messages with some pain, but there's no reasonable way for myself (or any other mhonarc users in the same predicament) to retrofit against Date based URLs. For the record, here's what mhonarc embeds in each HTML page it produces because these were considered the important headers. In this message sent from Australia, the date shows a timezone of UTC -0700, because it was pulled from the received header.

So my main request is to double check the numbers, see if using "Date" really buys as much as one thinks. I'll keep digesting the other aspects of the wiki page.

Barry Warsaw

1:34 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 25, 2007, at 12:47 AM, Jeff Breidenbach wrote:

...

...
What you gain from my proposal over a pure Message-ID approach is guaranteed uniqueness given the list copy

Guarantee is a pretty strong word. A malicious person could post two messages with the same message-id, same date, but different bodies.

No question, if the archive service and the list server are not
intimately connected, the communication channel between the two can
be subverted. There are ways that channel's trust could be enhanced
though, for example by the list server signing its headers in a dkim- like fashion.

But in situations where the two are co-located, you can trust these
headers even without that enhancement.

...

So that moves us to how many collisions are reduced in practice. I have a question about the numbers Barry mined from the python lists. Are the collisions really that high? One should not count messages without a message-id, because the MLM can and should create one in that case.

I've uploaded the script I used to here:

http://wiki.list.org/download/attachments/786633/scan.py?version=1

It's probably not perfect, and certainly the python.org mbox's may
not be representative enough of the real world. Please grab the
script, tweak it and run it over your own raw archives; it should be
easily modified to handle any of the mailbox formats supported by
Python 2.5's mailbox module.

If you improve the script or find numbers that lead to different
conclusions, now's the time to know!

...

...
and human friendlier urls.

That's a very compelling point.

SHA1 can't be computed inside someone's head or simple cut-n-pasted together for old messages, but I think the usability benefits of
short URLs (short enough that they can comfortably fit inside message
bodies) outweighs this drawback. By the way, is SHA-1 still in favor? My impression was it was fading away after the Shandong University team partially cracked it.

We're not concerned with the cryptographic security claims of SHA1.
I don't see any economically beneficial attack on the archives
against SHA1 here. I think SHA1 is reasonably universally available,
and marginally better than MD5, so it's probably good enough for this
application.

You're right that no one is going to do SHA1 in their heads, and if
they could, they're probably working for some TLA in a secret gubmit
basement lab somewhere. The point of course is that a /program/
could easily apply the algorithm to a very minimal existing message
and come up with the same canonical url. This enables all kinds of
cool applications based on REST-y principles or whatever. The fact
that the algorithm leads to short(ish), largely unambiguous (to
humans), readable urls is an important benefit -- probably /the/ most
important benefit.

...

...
Throw it away or hide [Date]? The former would be a problem, but not the latter.

Thrown away.

Really? Wow. I'd have thought every archiving service would want to
keep a record of the raw message it received on the wire. That would
allow it to regenerate the html archive if necessary, provide useful
forensics, and allow for exactly the kind of data mining we're doing
here. I can't see /any/ reason for not saving the raw messages in
their entirety, especially for a public list. Maybe for a private
one, where your data retention policies require you delete things
after a certain amount of time, but even there, I can't see why you'd
want to trim raw messages rather than just chucking them entirely.

...

My favorite archival service is based on mhonarc, and raw mail goes into offline cold storage.

What's the advantage of that? Isn't disk space cheap as dirt?
Probably cheaper if you've bought any topsoil recently :). Still,
the raw messages are still available right? So if there was enough
value in calculating the canonical urls so that the archive service
could be seen as an interoperability good citizen, then it could be
done.

I'll just reiterate that I'm not married to including the Date header
in the algorithm. Until proven otherwise by more research, I think
it's a good idea to use because 1) it's required by RFC 2822 and 2)
it seems to reduce collisions. I think the algorithm I propose would
work just as well with Message-IDs alone, although there's more of a
chance that the non-sequence numbered url will return multiple matches.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdRVnEjvBPtnXfVAQJiOgP/UIufdisvgVPV3qKo4dV2bfWoUPcp/dIQ iGj9faWXFwa/NoOk3HtIZbu7JVrJEY2t9nihJX6lEjZ1Q6AFH1hkObx0dV5NRfj2 KjRANxU6UsBvpDCzBQWthX1d7HviRJ74Pio5hVti+0YoV4pjq8UHaxTlrECHmkad ERlOYR2onAQ= =8b8I -----END PGP SIGNATURE-----

Jeff Breidenbach

6:23 a.m.

...

If you improve the script or find numbers that lead to different conclusions, now's the time to know!

Live and learn!

So I just looked at 2 million raw messages from 2007, spread over a few thousand mailing lists (all data is from mail-archive.com). My first question was - when comparing only with messages from the same list - how many times do I see a repeated message-id? The answer was ... drumroll please ... 260 thousand. What the hell?

Time for a closer look. In some cases, the archiver was getting two copies of every message. For example, the MLM (mailman) was sending out a message to subscriber A and subscriber B, and both paths eventually lead to the archiver.

In another case, the MLM (YahooGroups) spammed 20 copies of the same message to every subscriber, and modified the body of each one. YahooGroups tends create HTML mail and sticks ads, possibly spyware, and who knows what other crap in message footers.

There's probably other categories I haven't noticed yet, 260k messages is a lot of checking. So you'd think the archives would be a complete mess. But they aren't and I had no idea anything was remotely amiss under the hood. That's because mhonarc only archives one message per message-id. So those 19 repeats from YahooGroups get thown away. This is actually a pretty robust strategy when you think about it; it keeps lots of annoyances out of archives and everyone who gets smited deserves it; accidental duplicates, malicious duplicates, broken mail transfer agents. Reasonable people can disagree, but I like it.

So I'm amending my request. If mailman and pipermail++ want to keep a verbatim record of everything passing through the MLM, fine. But please make it also possible to interoperate with archivers that use the looser mhonarc strategy, e.g. allow the interoperability URL to collide when message-ids collide. Currently Stephen's proposal allows this, Barry's does not.

Just to make things really concrete, here's an example from that YahooGroups collision I was describing. The 20 messages spammed to subscribers would all have a interoperability URL something like this (but perhaps not quite so enormously long) embedded in the message, in both headers and possibly a footer.

http://www.mail-archive.com/search?l=estika%40yahoogroups.com&q=3578.125.161.129.196.1175036508.CBNWebMail%40webmail1.cbn.net.id

Clicking on it, the user goes to the archive server. For this particular archiver, an HTTP 302 redirect takes the user to another URL which happens to be more human friendly. But the details of what alternate URLs are available - if any - is really up to the archive server.

http://www.mail-archive.com/estika@yahoogroups.com/msg01341.html

I think that's about it. I do kind of like Stephen's suggestion of allowing the archiver to supply a formuia for interoperability URL; if that's the case I'd say the RFC2369 headers could be fair game for use in the calculation. That allows cross posted messages to easily link to their correct archive - note how I used the contents of List-Post when creating the interoperability URL above.

Jeff

Dale Newfield

7:37 a.m.

Jeff Breidenbach wrote:

...

I think the question you were originally going to ask got sidetracked. If we assume that all these "multiple paths from list to archive" duplicates not only share a Message-ID but also a Date (they were the same message originally, so they should!), then both schemes (messageid, and messageid+date) would decide that all (but one of) these messages are redundant.

What we really want to know is how many (non-empty) Message-ID collisions are there that *don't* share a Date? This is the number of messages that only-messageid loses, and that the composite identifier method would not lose.

-Dale

Jeff Breidenbach

August 2007

2:17 a.m.

...

It took longer than expected, but I now have numbers from looking at 2,151,896 messages spread over a few thousand lists. The appended script was run over a set of MH format raw messages.

704 messages fall into this category. Of these, 596 come from a single (malfunctioning and duplicate spewing) list server. I have not yet examined the remaining 208 messages, but I'll bet anything many also have duplicate message bodies. Or are spam. So for this data set, we have an upper bound of 0.01% messages in this category, possibly significantly less.

Jeff

#!/bin/bash # # Look for messages that # # Do collide with message-id # Don't collide with message-id + date

DIR=/home/archive/Mail

C1=0 C2=0

get_ineresting_messages() { cd $DIR/$1 for j in $(ls -U); do MSG_ID=$(cat $j | 822field message-id) MSG_DATE=$(cat $j | 822field date) if [ "$MSG_ID" != "" ]; then echo $MSG_DATE "|" $MSG_ID fi done |
sort |
uniq --separator='|' --skip-fields=1 --all-repeated |
uniq --uniq }

for i in $(ls $DIR | grep @); do DUP=$(get_ineresting_messages $i) DUP_CNT=$(echo -n "$DUP" | wc -l) MSG_CNT=$(cd $DIR/$i && ls -U | wc -w) C1=$(( C1 + MSG_CNT )) C2=$(( C2 + DUP_CNT )) if [ $DUP_CNT != 0 ]; then echo echo "=== collisions/messages: $C2/$C1 $i" echo "$DUP" else echo -n . 1>&2 fi done

...

Jeff Breidenbach

2:20 a.m.

...

Correction.

... remaining 108 ... 0.005% ...

Jeff Breidenbach

4:44 a.m.

...

I took a look at a larger dataset, 5.85 million messages from several thousand lists. Of the messages that share message-id but not date, most come from a small number of based web services.

875 come from forums.slimdevices.com 378 come from lists.openplans.org 265 come from nabble.com 164 come from egroups.com 135 come from yahoo.com 166 come from elsewhere

That's 0.03% if you count all the messages. It is 0.008% if you discard the top three offenders, all of which I have contacted. I didn't try contacting Yahoo/eGroups because in my past experience, talking to a brick wall is easier. I have not analyzed how many of these messages are spam or have duplicate bodies, which further discounts the percentages.

Hope this data helps.

Dale Newfield

5:04 a.m.

Jeff Breidenbach wrote:

...

5.85 million messages

...

That's 0.03% if you count all the messages. It is 0.008% if you discard the top three offenders, all of which I have contacted.

I'd say that's a strong argument for just using the Message-ID and simplifying this tremendously...

...Barry, do you disagree?

(It can still be a base32 encoded SHA hash it to make it less user hostile.) http://wiki.list.org/display/DEV/Stable+URLs

-Dale

Barry Warsaw

October 2007

2:47 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Aug 8, 2007, at 1:04 AM, Dale Newfield wrote:

...

No, I'm convinced. Apologies for taking so long to respond. The
code in the Mailman 3.0 branch has been updated to use only the
Message-ID. I still think the base32-encoded sha1 hash is a good
user-friendlier option but of course and that archivers should accept
either.

One question: should the angle brackets on the Message-ID be part of
the hash or not? I think they should, or IOW, the entire value of
the Message-ID header is taken as the hash, though they should be
stripped off if using the Message-ID in any kind of archive query.
I'm open to suggestions though... comments?

...

The wiki is down at the moment (I have a issue opened on the support
tracker about that). When it comes up, I'll update the page.

Thanks everyone for a very good thread, and especially for Jeff for
doing the analysis on real data.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iD8DBQFHAwLI2YZpQepbvXERArC8AJ9xJAtqHQPwipUnZuMOvkQ2yxWa0QCbBf+D KnPkuOJEFTZD38BfupCLvk0= =/kr1 -----END PGP SIGNATURE-----

Jeff Breidenbach

6:48 a.m.

Question: what about crossposted messages?

Let's say a message gets sent to a list called mailman-developers with a CC to a list called pet-bunnies. Hypothetically, of course. Presumably, the person who got the message from pet-bunnies should probably end up at the pet-bunnies archive, where the message can be viewed in proper context; right before the processed carrots flamewar and after the manifesto on proper hopping technique. To make that work, I think we need some way to - at least optionally - allow one or more of the RFC 2369 headers to influence the archival URL. Reading the wiki, I guess that's where List-Archive comes into play?

My other question is about the angle brackets. Barry, why are you inclined to include them in calculations? It's kind of arbitrary, but quoting RFC 2822, end of section 3.6.4:

Semantically, the angle bracket characters are not part of the msg-id; the msg-id is what is contained between the two angle bracket characters.

Ian Eiloart

9:53 a.m.

--On 2 October 2007 22:47:35 -0400 Barry Warsaw <barry@list.org> wrote:

...

Mathematically, the two solutions are equivalent for valid headers, aren't they? OK, the hashes will be different, but only in a trivial sense.

Technically, I imagine, it's going to be easier to handle bogus headers if you just hash the entire header. For example, what do you do if some piece of crapware gives you a message with a header missing the angle brackets? Or that adds something outside angle brackets? Or that includes a right-angle bracket in the message-id itself?

You don't have to think about any of those situations if you either (A) reject the message or (B) encode the entire header.

-- Ian Eiloart IT Services, University of Sussex x3148

Jason Fesler

July 2007

2:29 p.m.

...

Guarantee is a pretty strong word. A malicious person could post two messages with the same message-id, same date, but different bodies.

This is my concern too. Especially since this is known information; it is trivial to be malicious. Whatever was done, I think would *have* to deal with 'dupes', in some form or another.

Stephen J. Turnbull

3:04 a.m.

Jeff Breidenbach writes:

...

...
So we just specify a header to put it in, and subscribers will be able to use it, per definition of a canonical URL.

It is the archive server's job to decide what is the "canonical" URL for a message. There's a good chance these archival URLs will be served by an HTTP redirect. So let's not use the word canonical. :)

If it's not going to be "canonical" (I forget if there's a standard for that word :), what is the point in writing an RFC?

...

...
What complexity? Mailman just does

msg['X-List-Archive-Received-ID'] = Email.msgid()

Easy to introduce, harder to deal with. The archival server would now keep track of both the message-id and the x-list-archive-received-id. That's two namespaces that almost do the same thing.

The implementations are similar, and there is "nearly" a one-to-one correspondence. But the semantics are very different. Message-ID is untrustworthy, the internal ID is trustworthy.

...

So for these reasons, I'd rather stick with message-id and risk some real world collisions, instead of introduce another identifier.

Go ahead and stick with message-id if *you* like, but please don't tell *me* what risks I have to accept.

There needs to be a way to *enforce* uniqueness, and it *must* be specified by the RFC in order for archive implementations to be interoperable. Note that word "specify"; I do not insist that this level of robustness be *required*. But if we don't specify it now, people who want such robustness will have to do all this work again, and possibly will end up with something that some servers conforming to "your" RFC will not conform to.

It is possible that most archivers will simply use the message ID, and do something brutal in the rare case of a collision. That's fine. But an archiver that wants to provide a canonical URL which is guaranteed to uniquely and losslessly identify a post in its archive should have a standard way to do that.

...

The main thing that bugs me is message-ids are long, which makes them awkward to embed in a URL in the footer of a message.

The footer URL is of no concern in this discussion. There is not going to be a requirement that footer URLs be "canonical", not if I have any say in the matter. The "canonical" URL will be in (or be constructed from) the message header.

Barry Warsaw

1:17 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 24, 2007, at 11:04 PM, Stephen J. Turnbull wrote:

...

I completely agree. Maybe "interoperable" is the right word to use.
Or "user friendly interoperable archive url" which is really what
we're trying to define here (IMO).

...

Yep.

...

Yep.

...

Agreed in the sense that the RFC 2822 headers must contain all the
information necessary to construct the canonical url (or must contain
the canonical url). A list server /can/ decorate the message with
the url in other ways, but that certainly isn't necessary.

You might even imagine a mail reader extension that read the
appropriate List-* headers and added a button "View In Archive" which
sent the canonical url to your web browser. Once that happens, the
archive service is free to redirect to its hearts content. I submit
though that any good archive service (and certainly Pipermail++ if I
can help it) will ensure that those urls are stable forever,
otherwise people will stop relying on it.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqdNWnEjvBPtnXfVAQIZRAP/Ux9rUK6ToH5Zl2XTC8LOKgCG+1yhf4pw h4XVZc0nmP1xxFttsXzsuY+/oGFW8yrY0yGnxK4N5EKUEpIxejGNbVtAjpQ5l/Sy ml5R5kDhZtk/d8tE9IXOzB5zCcxdmMgjX3KfL78t5L6JzAQ4RgM0MTYxPH69AdHW zpvhBCow/z8= =KiqU -----END PGP SIGNATURE-----

Barry Warsaw

6:53 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 24, 2007, at 2:56 AM, Stephen J. Turnbull wrote:

...

I think there's two approaches we could argue for. One is for the
mailing list manager to craft a UUID out of whole cloth and stick
that in a header. Then any downstream archiver would be obliged to
use that header value as the canonical address of the message, with
an alternative path to the message via the Message-ID (possibly
returning a list of matching messages when there are collisions).

The second approach, and the one that I favor, is to use the Message- ID (and the Date) header on the original message as the UUID,
properly handling corner cases like duplicate headers or missing
header. This UUID servers as the basis for the address to the
message resource just like above.

I like the second approach better because in the case where you start
with an off-list copy of the message, you have a decent enough chance
of getting to the archived message, or at least to a resource
containing a link to the message. The first alternative would
require access to the list copy.

Imagine if every archiver supported my proposal, knowing just the
Message-ID and Date header, you could get to that message from almost
anywhere, just by using the UUID as a relative URL rooted at say
http://www.mail-archive.com, http://groups.google.com, http:// mail.python.org/pipermail, or whatever. That would be pretty neat.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZKpnEjvBPtnXfVAQJWcwP6A6SqHTeft+c/5IeSpRsI+gvtPJW94fcG pjB66oYiKco7U+rZtxll3TPD9Ta7gccohq72sh8hV7CHRW7Cd531Hq91z7QktHUW zqzxkMimoca7WlUxr0/ElyPNhRkjMlR8LvhNCjs4a9O6/PpzBTNjsXwaTKfLrqO3 N5iq3BWoMK8= =fSNC -----END PGP SIGNATURE-----

Barry Warsaw

6:44 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 24, 2007, at 2:02 AM, Jeff Breidenbach wrote:

...

I've always thought that an RFC-like spec that describes how a
generic mailing list manager would interoperate with a generic
archiving service is the way to go. I've written up a somewhat more
formal spec of what I've implemented MM3 currently here:

http://wiki.list.org/display/DEV/Stable+URLs

If this looks good, I'd be happy to approach some of the related
communities to try to get buy-in.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZIjHEjvBPtnXfVAQLK9AP/VQveYtFuZhJam9TITYBuMyc8pig7nqDt efn4DIXhZhgtqBQ58/TgEFZnTkKfiZ1HLdoovrQye8HdKZmuAd+SJrOkq/aO9fIC ZgaV5HYBD7TcnQuO2z5eRuK3IY7FpWoeZrn/a6sxBObsaSOrOTjhqs1gv5go24d3 8CmG/bB9LTo= =EyoU -----END PGP SIGNATURE-----

John Dennis

4:09 p.m.

On Tue, 2007-07-03 at 20:05 -0400, Barry Warsaw wrote:

...

A little over a year ago I went on a search to find the best open source archiver and at that time I came up with Lurker (http://lurker.sourceforge.net) Since then I believe Lurker has seen a major new revision. I also believe Lurker is the archiver used by Debian.

So if you want to leverage existing open source archiving or at least look at an example of what would be necessary to allow easy easy external archiving integration with Mailman you might want to look at Lurker.

John Dennis <jdennis@redhat.com>

Terri Oda

5:02 p.m.

On 5-Jul-07, at 12:09 PM, John Dennis wrote:

...

I was hoping someone would post that link! Lurker was best of breed
last time I was looking, and I'd definitely like to see what we can
leverage there.

Terri

Barry Warsaw

12:39 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 5, 2007, at 12:09 PM, John Dennis wrote:

...

I've looked at a few lurker archivers and I wasn't blown away by its
user interface. That's apparently highly configurable though.

Lurker's GPL2 so that's fine. I'd be quite hesitant about shipping
Mailman with Lurker because it's something we don't control and it's
not Python. But I would be totally open to working with the Lurker
developers on creating an easy bridge between the two systems.
Perhaps this dovetails with Jeff's suggestion of easier integration
with external archiving systems.

Does anybody have contacts with the Lurker community that could cross- post a new thread to get the discussion going?

(The same goes for any other archiver out there too.)

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCtBnEjvBPtnXfVAQLgJwP9HNu/r/5YYAGn0HcQAhD8b8plDSpm2tao VcC7tROs0EyjRAQd1b3+hF102FMZzTXF/8LifgETN8K4MD9TXkxNhrTlKjmAUhLG 1tvHZT9oD73aLb81m2SuI3nbp8kQSMncPeMM4u1vGzpXfCYGK4chAPyIJ1Z5MNqj 6byAgVpwZEo= =qjmf -----END PGP SIGNATURE-----

Nigel Metheringham

1:17 p.m.

On 20 Jul 2007, at 13:39, Barry Warsaw wrote:

...

I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though.

I'd be inclined to agree wrt user interface. Documentation regarding this, and anything else to do with lurker, appears somewhat scarce - speaking as someone who has just migrated the exim.org lists to using lurker archiving. [previously we used mailman with the MHonArc/pipermail hybrid]

I am considering starting a set of pages within our wiki about use of lurker (we tend to cover almost everything else about mail so why not that).

...

Lurker's GPL2 so that's fine. I'd be quite hesitant about shipping Mailman with Lurker because it's something we don't control and it's not Python. But I would be totally open to working with the Lurker developers on creating an easy bridge between the two systems. Perhaps this dovetails with Jeff's suggestion of easier integration with external archiving systems.

Integration with externals feels like a good way to go.

...

Does anybody have contacts with the Lurker community that could cross- post a new thread to get the discussion going?

The ML appears... lacking in vigor..

BTW lurker gives all messages an ID which is 3 parts separated by periods. The first part is a date field - ie 20070720, the second part is the receive time, UTC, as 6 digits, and the final part is some form of hex id. The nice part is if you quote just the first (or first 2) parts of message ID you get messages around that time...

Nigel.

-- [ Nigel Metheringham Nigel.Metheringham@InTechnology.co.uk ] [ - Comments in this message are my own and not ITO opinion/policy - ]

Barry Warsaw

2:26 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 20, 2007, at 9:17 AM, Nigel Metheringham wrote:

...

I noticed that! There's no documentation link on the site. I also
saw your question regarding getting a message out of lurker given its
message-id. When I checked yesterday I didn't see a response.

...

That would be cool. Feel free to add a link to your pages on the
Mailman wiki, perhaps here:

http://wiki.list.org/display/DOC/Home

...

Obviously Mailman can't know the second and third parts so it can't
use them in its list copies. I dislike using YYYMMDD because of the
high number of collisions.

I should make clear that what I'm really proposing is not specific to
Mailman or any particular archiver. It's really an interface to a
generic message store. We succeed by convincing other mailing list
software and archivers to adopt the same standard so that they can
interoperate seamlessly. We can perhaps have the first
implementations of this defacto standard (any latent RFC shepherds
out there? :). We get everyone else to adopt it when we take over
the world.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDGNHEjvBPtnXfVAQIwVQQAlwcmmuoXz/vKlpdu27wCHnfpwhhrQMmn DWMEayuJsG+qg3GvkwyHGkgTBalENdDWWAQpPE9Zf9nmY24FyqhqRpe/QhOCajBV 4+lvXR1FARur4y4E9Lzcjz1TzX3lkaxx3dVCqpOtJxNVVvv442eYsLf11E3Z+wxY m+ootMkR5pE= =y4za -----END PGP SIGNATURE-----

Nigel Metheringham

2:38 p.m.

On 20 Jul 2007, at 15:26, Barry Warsaw wrote:

...

Its used as part of a UID, but has the nice feature of allowing easy queries as to other messages at that time.

If the archiver is local you also have the information for part 2 of the UID - lurker takes it from the From_ line.

Nigel.

[ Nigel Metheringham Nigel.Metheringham@InTechnology.co.uk ] [ - Comments in this message are my own and not ITO opinion/policy - ]

Barry Warsaw

2:52 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi Nigel,

On Jul 20, 2007, at 10:38 AM, Nigel Metheringham wrote:

...

That should definitely be a way to traverse to the message, but it's
not the message's global id (a.k.a. canonical address relative to the
base url of the message store). An archiver could provide other ways
to traverse to the message, such as:

/barry@python.org/ to see all messages by me /barry@python.org/mailman-developers/20070720 to see all messages by
me today to this mailing list /Subject?Improving%20the%20archives&sort=thread to find all the
messages in this thread regardless of when they were posted

etc.

...

Mailman gets the From_ line before passing off to the archiver. But
that's interesting, does lurker /require/ the From_ line?

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDMInEjvBPtnXfVAQKJFAP/Y3FsBIXrSaRZ85eCl+pVTZxez2uRn0KB 2OMBV6vS/qC8K1R/myeGpBVr44yE/AfTa+kf+MLSlIlMpJdUlWDMWw2G90IPy1gv t1VGrwbVPmOlLFxF8kIsi6NKIZpKoJrJVdQnSc+uPCqowIDU9FQ57+2hrH8HayTS ISAZ0FTgAzk= =sp+m -----END PGP SIGNATURE-----

Nigel Metheringham

2:59 p.m.

On 20 Jul 2007, at 15:52, Barry Warsaw wrote:

...

Mailman gets the From_ line before passing off to the archiver.
But that's interesting, does lurker /require/ the From_ line?

Well lurker handles Maildir - no From_ but the same info is in the
filename, and it can take messages on stdin without a From_ - at
which point I guess its either faking it (from the headers) or making
things up.

Nigel.

-- [ Nigel Metheringham Nigel.Metheringham@InTechnology.co.uk ] [ - Comments in this message are my own and not ITO opinion/policy - ]

Barry Warsaw

3:16 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 20, 2007, at 10:59 AM, Nigel Metheringham wrote:

...

Cool. I wonder if lurker is compatible with Python 2.5's
mailbox.Maildir implementation and whether the two could share the
maildirs. Thanks for the information!

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqDRw3EjvBPtnXfVAQJHXwP/SiKhWiZ57thW84RBUWt9QVjf4KISEfRJ H5lioRVPYYegiJp7rf/08TutkNsxGCHzRd/cdMEFXMkrCAdifLQ2QIdS4LRvEKyY eRbVHcmxyAlwMbyUq36W+pcH2MutTM64HKNrbL9YRSTaLyMA11FnmaiGIK3RMnbM AqtLGRSJ8Ec= =D8oM -----END PGP SIGNATURE-----

A.M. Kuchling

9:16 p.m.

On Fri, Jul 20, 2007 at 11:16:19AM -0400, Barry Warsaw wrote:

...

It had better be -- Maildir has a published specification. If there's an incompatibility, that would be a bug in either mailbox.py or lurker.

--amk

Terri Oda

4:33 p.m.

On 20-Jul-07, at 8:39 AM, Barry Warsaw wrote:

...

I've looked at a few lurker archivers and I wasn't blown away by its user interface. That's apparently highly configurable though.

I've been doing a lot of thinking about interface, and I'm coming to
the conclusion that something more like a web bulletin board is
probably the way to go, given that people use them all the time
without much trouble and with a fairly minimal amount of whining. ;)
I'm trying to use interfaces to things like comment systems (which
are often threaded -- picture the slashdot stuff, maybe?) and popular
boards like phpbb (which isn't threaded beyond separate topics) as
guides to how people usually deal with conversations on the web.

It'd actually be fairly easy, at that point, to just put a posting
interface into the archives (yes, you'd have to be logged in, and
yes, this means your password becomes that bit more valuable because
someone having it can pose as you to the list... but they could do
that by spoofing your email address so I'm not too concerned). But
then people who don't like email or just want to pop by and check the
list quickly could actually use mailman like a web board, which is
something I'm pretty sure would get used (I know my users have asked
for it in the past).

I've been drafting simple prototype interfaces in my head, trying to
keep potential architectures in mind. I'm hoping I'll have time this
week to code some up HTML and see how well they actually work when
they're not just inside my head. :)

Terri

Dale Newfield

9:17 p.m.

Terri Oda wrote:

...

For public lists, the answer may lie in external tools like nabble.com or mailinglistarchive.com

Of course, that doesn't help for lists wishing to keep their content private.

-Dale

Barry Warsaw

6:41 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 22, 2007, at 12:33 PM, Terri Oda wrote:

...

I like this for several reasons. I've long wanted a bridge between
the traditional mailing list and a forum because to me they're
related along a spectrum of emotional investment.

What I mean is this. For the subjects and projects I care deeply
about, I join the mailing list. I want to be intimately involved in
the day-to-day collaboration that being subscribed gives me. I care
enough about that that I'm willing to put up with the pain that comes
along with mailing lists, such as the overhead for subscribing,
deleting topics I don't care about, the occasional spam, the overhead
of going on vacation or leaving the list, etc.

But there are even more topics or projects that I have only a
fleeting interest in. Say I find a bug in some X program, or wake up
and decide to learn how to use setuptools, or find that some recent
update broke my Linux server. In all those cases, I might want to
start a thread of discussion or ask a question, and be very involved
in that thread for a week or two. Then, my interest wanes, or I get
my question answered, or other projects pique my interest. Mailing
lists are pretty bad at managing those kinds of fleeting involvement,
but forums are quite nice. There's usually fairly low overhead (and
probably even less if OpenID and such were in widespread adoption)
for joining, and when I lose interest the forum doesn't fill up my
inbox. OTOH, forums seem good for short 'instant' messages, but not
so good (IMO) for free ranging, detailed discussions. So there's a
spectrum.

...

Heck, /I'd/ use it, so what more justification do we need? :)

...

I'd love to see the prototypes once you've committed them to HTML.
The one important thing is that the individual postings will need the
equivalent of a stable archive URL (i.e. permlink) that could be
passed around, added to web pages, etc.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqZH43EjvBPtnXfVAQLrzQP8CG5ALhX+Wk91I+jri20R60C7cqtCzQby V9MD8FlhC/7LbRW3QXwJnwWSpXCnBYhShxmRMn2maEeIXqPUEBl3QOcUYkHxeRZG zV6sKE1J1EZfbUTY7CM3lcnOZKHB1n07PGslcxQsJHEmnbuHbR7bm+2AV2CknzZj 8Y/9XxPjX5Q= =IRq2 -----END PGP SIGNATURE-----

Paul Wise

5:06 a.m.

On 7/3/07, Terri Oda <terri@zone12.com> wrote:

...

At lists.indymedia.org, we use a patch that provides these:

stable URLs based on a generated message id
URLs to the archived message in the message headers
message hiding

http://lists.indymedia.org/patches/imc-10-mmid_hide_posts.patch

It poses a bit of a migration issue since all the existing mboxes may or may not have the mmid header in them. We worked around that by having an special place for the old archives.

We've been meaning to move to lurker for years, but haven't had the human resources and also there were some showstoppers:

public/private lists - lurker couldn't do that properly when we looked
lack of date-based index to the archives
general navigation issues; stuff like linking between current thread and nearby ones
mailto links (has now been fixed)
the migration nightmare

My personal opinion is that pipermail should be removed and mailman should not contain a default archiver since there are plenty of good archivers already (lurker, mhonarc etc). Adding wrappers around them would be simpler than reimplementing them.

-- bye, pabs

https://docs.indymedia.org/view/Main/PaulWise

Barry Warsaw

12:43 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On Jul 8, 2007, at 1:06 AM, Paul Wise wrote:

...

My hesitation to this has always been the turnkey question.
Pipermail has it's problems but it /does/ allow small sites to get
going very quickly with a full(-ish) solution.

It may be that most people get their Mailman installation from their
distro or hosting service and this is no longer as important. In
that case, I still wouldn't chuck Pipermail, but I would try to see
if we can adopt Jeff's goal of making the archive selection pluggable
and easily selectable by list admins.

-Barry

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin)

iQCVAwUBRqCt4HEjvBPtnXfVAQJHQwP+P4KAQaA7uEeISQjFyb3zoMvOWwgoW3zH taWsnVAhVmAF/hJBWDn7JtXwWiLw7ngCtGHp3MBKGBKzBjJP7ZizEMNfziaB+OoO LOyF7sYB+KhKVi+Il7XnHYIjh6DSD8kullP+G/UNtuIsFnNs+aTntndfMKJG2Zct E7M0F1Ok8FE= =xXQJ -----END PGP SIGNATURE-----

6354

Age (days ago)

6446

Last active (days ago)

List overview

Download

63 comments

16 participants

participants (16)

A.M. Kuchling
Barry Warsaw
Barry Warsaw
Dale Newfield
Gustav H Meyer
Ian Eiloart
Jason Fesler
Jeff Breidenbach
John A. Martin
John Dennis
Nigel Metheringham
Paul Wise
Stephen J. Turnbull
Stephen J. Turnbull
Steve Huston
Terri Oda

Improving the archives

Steve Huston

John A. Martin

John A. Martin

So if you want to leverage existing open source archiving or at least look at an example of what would be necessary to allow easy easy external archiving integration with Mailman you might want to look at Lurker.

Nigel.

tags

participants (16)