[Mailman-Developers] Improving the archives

Jeff Breidenbach jeff at jab.org
Thu Jul 26 08:23:55 CEST 2007


> If you improve the script or find numbers that lead to different
> conclusions, now's the time to know!

Live and learn!

So I just looked at 2 million raw messages from 2007, spread over
a few thousand mailing lists (all data is from mail-archive.com). My
first question was - when comparing only with messages from the
same list - how many times do I see a repeated message-id? The
answer was ... drumroll please ... 260 thousand. What the hell?

Time for a closer look. In some cases, the archiver was getting two
copies of every message. For example, the MLM (mailman) was
sending out a message to subscriber A and subscriber B, and both
paths eventually lead to the archiver.

In another case, the MLM (YahooGroups) spammed 20 copies of the
same message to every subscriber, and modified the body of each one.
YahooGroups tends create HTML mail and sticks ads, possibly spyware,
and who knows what other crap in message footers.

There's probably other categories I haven't noticed yet, 260k messages
is a lot of checking. So you'd think the archives would be a complete
mess. But they aren't and I had no idea anything was remotely amiss
under the hood. That's because mhonarc only archives one message
per message-id. So those 19 repeats from YahooGroups get thown away.
This is actually a pretty robust strategy when you think about it; it keeps
lots of annoyances out of archives and everyone who gets smited
deserves it; accidental duplicates, malicious duplicates, broken mail
transfer agents. Reasonable people can disagree, but I like it.

So I'm amending my request. If mailman and pipermail++ want to
keep a verbatim record of everything passing through the MLM, fine.
But please make it also possible to interoperate with archivers that
use the looser mhonarc strategy, e.g. allow the interoperability URL
to collide when message-ids collide. Currently Stephen's proposal
allows this, Barry's does not.

Just to make things really concrete, here's an example from that
YahooGroups collision I was describing. The 20 messages spammed to
subscribers would all have a interoperability URL something like this
(but perhaps not quite so enormously long) embedded in the
message, in both headers and possibly a footer.

http://www.mail-archive.com/search?l=estika%40yahoogroups.com&q=3578.125.161.129.196.1175036508.CBNWebMail%40webmail1.cbn.net.id

Clicking on it, the user goes to the archive server. For this particular
archiver, an HTTP 302 redirect takes the user to another URL which
happens to be more human friendly. But the details of what alternate
URLs are available - if any - is really up to the archive server.

http://www.mail-archive.com/estika@yahoogroups.com/msg01341.html

I think that's about it. I do kind of like Stephen's suggestion of
allowing the archiver to supply a formuia for interoperability URL;
if that's the case I'd say the RFC2369 headers could be fair game
for use in the calculation. That allows cross posted messages to
easily link to their correct archive - note how I used the contents of
List-Post when creating the interoperability URL above.

Jeff


More information about the Mailman-Developers mailing list