[Mailman-Developers] Braindump on new archiver

Thomas Wouters thomas@xs4all.net
Sat, 16 Jun 2001 01:15:45 +0200


I've been thinking about re-doing the whole archiver thing in Mailman for a
while now, and I need to start writing these things down before I forget
about them. What follows is basically a braindump, for your amusement and
free to comment on.

First off, I have a couple of handicaps. Some are severe. The first one is
that I have very little time, which is one of the reasons I'm writing this
braindump: I don't have time to implement it right now :P The other is that,
for a living, I build and maintain large to very large, high-performance
internet servers (mostly Apache webservers with bells 'n whistles.) This
means I am usually too performance oriented. And lastly: I like the
'Maildir' mailbox format very much. This is a handicap because,
unfortunately, not all mail programs support it. 'Pine', specifically, is
boycotting the Maildir format (don't ask me why, I gave up on them.)

The good thing about the Maildir format is that it's very NFS secure. All of
our customers (except a handful who refuse to upgrade from 'elm' to 'mutt')
have a maildir mailbox, and we have 6 SMTP delivery boxes, 5 POP retrieval
boxes, one IMAP retrieval box (used only for our webmail server, which is
why it's only one right now) and several shell servers that have all kinds
of UNIX mail clients running. The mailspool itself is a dedicated NFS server
(a NetApp Filer.) We have a patched version of Pine that works with maildir
whether it likes it or not :)

Maildir works by making every message a file on its own. A mailbox is a
directory with three subdirectories, 'cur', 'new' and 'tmp'. Messages in
'new' are unread, messages in 'cur' are read, and messages in 'tmp' are in
transit. Message-metadata and state is maintained using the name of the
message-file itself. Delivery is done by creating a new file in
mailbox/tmp/, writing all data to it, and when it's done, moving it to
'new'. Writing data to a file is not trustworthy, over NFS, but renaming and
moving files are.

What I'm thinking for a Mailman archiver (or just any email-webarchiver, in
fact) is something like the following:

- Mail delivery can happen on any host. The Mailman process, after doing its
normal pre-delivery stuff, 'delivers' the message by dropping it in a
Maildir folder (rather than appending it to an 'mbox' mailbox) _and_
cgi-escaping/header-cleaning the message and dropping it in a
monthly/weekly/daily archive directory. The names are generated from uniq
data, like the time of day, the machine name and the process-id, possibly
hashed for obscurity. No locking is necessary. The URL to that _specific_
message is known as soon as Mailman calculates that name, and so it can be
inserted into the message that will be sent out to the list. (A common
feature request :)

- Non-text attachements could be saved as separate files, and replaced by
URLs to them in the archives. This should be possible to do both based on
type and on size, of course.

- The web archives are indexed once every [X] minutes, by looking at all the
new messages and processing them. Processing them consists of creating
indexes based on threads, subjects, author, etc, exactly like now. State
data could be stored in a .db (berkeley db, marshall, pickle, whatever) and
need not even be protected by a lock, if done properly: When started, the
indexing process should compare the mtime of the .db file with the current
time, as well as the last added new message. If the .db file is too new, it
should not run. If the .db file is old, and the newest message is newer, it
should touch the .db file (leaving it exactly as it is, just updating mtime)
and start working, keeping in mind to touch the file every now and then. The
new .db file should be written to a temporary name, and rename()'d when it's
done.

- The individual archive message URL's would actually call a CGI script that
looks up state data from the .db file, grab the cgi-escaped file, prepend
and append standard header and footer containing 'next in thread', 'next in
subject' and the like generated from the .db file, and feed that to the
browser. If state data is absent, but the message itself is there, teh
standard header and footer are added with a note saying that the message is
too new to have links to other messages -- but at least the message will be
visible.

Mail delivery and the Web-interface can be run on multitudes of servers, all
sharing data over NFS, and still be able to communicate in a sane way,
without waiting for eachother's locks forever. The only thing that can go
wrong is for a message to be archived and sent out, and then read and looked
up in the archives before the actual NFS data (the content of the message)
is visible on the web-interface machine(s). Given the average speed of SMTP,
I do not consider this a serious problem :)

One thing should run on a single server, however, and that's the
archive-indexing process. Given that news-feeding and mailing out passwords
should also be done on a single server, I don't see that as a problem either.
And all of the above should work just as fine on a single machine doing
everything on a local disk.

While under the 'the filesystem as the database' spell (hell, I'm still
under that one) I also considered using the filesystem to index messages
based on thread, subject, author, etc: a directory for each category, in
which symlinks point to messages. To go from message 'X' to the next message
in the thread, check where the symlink 'next-threads/X' points. The same for
previous-thread, next-author, etc. I'm not sure if it will managable without
adding a lockfile, nor how efficient it would be to generate overviews from
such data.

Another thing for consideration is the 'downloadable mailbox' link. Should
it download the maildir mailbox, or generate an mbox one on demand ? The
maildir mailbox is *very* useful for incremental updates, since all that
takes is checking which files you've already read. But, like I said, it's
not as widely supported as the crappy ol' mbox format. 

End-of-rant--we-will-now-return-to-your-regular-scheduled-static-ly y'rs,
 ;)
-- 
Thomas Wouters <thomas@xs4all.net>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!