Braindump on new archiver

I've been thinking about re-doing the whole archiver thing in Mailman for a while now, and I need to start writing these things down before I forget about them. What follows is basically a braindump, for your amusement and free to comment on.
First off, I have a couple of handicaps. Some are severe. The first one is that I have very little time, which is one of the reasons I'm writing this braindump: I don't have time to implement it right now :P The other is that, for a living, I build and maintain large to very large, high-performance internet servers (mostly Apache webservers with bells 'n whistles.) This means I am usually too performance oriented. And lastly: I like the 'Maildir' mailbox format very much. This is a handicap because, unfortunately, not all mail programs support it. 'Pine', specifically, is boycotting the Maildir format (don't ask me why, I gave up on them.)
The good thing about the Maildir format is that it's very NFS secure. All of our customers (except a handful who refuse to upgrade from 'elm' to 'mutt') have a maildir mailbox, and we have 6 SMTP delivery boxes, 5 POP retrieval boxes, one IMAP retrieval box (used only for our webmail server, which is why it's only one right now) and several shell servers that have all kinds of UNIX mail clients running. The mailspool itself is a dedicated NFS server (a NetApp Filer.) We have a patched version of Pine that works with maildir whether it likes it or not :)
Maildir works by making every message a file on its own. A mailbox is a directory with three subdirectories, 'cur', 'new' and 'tmp'. Messages in 'new' are unread, messages in 'cur' are read, and messages in 'tmp' are in transit. Message-metadata and state is maintained using the name of the message-file itself. Delivery is done by creating a new file in mailbox/tmp/, writing all data to it, and when it's done, moving it to 'new'. Writing data to a file is not trustworthy, over NFS, but renaming and moving files are.
What I'm thinking for a Mailman archiver (or just any email-webarchiver, in fact) is something like the following:
Mail delivery can happen on any host. The Mailman process, after doing its normal pre-delivery stuff, 'delivers' the message by dropping it in a Maildir folder (rather than appending it to an 'mbox' mailbox) _and_ cgi-escaping/header-cleaning the message and dropping it in a monthly/weekly/daily archive directory. The names are generated from uniq data, like the time of day, the machine name and the process-id, possibly hashed for obscurity. No locking is necessary. The URL to that _specific_ message is known as soon as Mailman calculates that name, and so it can be inserted into the message that will be sent out to the list. (A common feature request :)
Non-text attachements could be saved as separate files, and replaced by URLs to them in the archives. This should be possible to do both based on type and on size, of course.
The web archives are indexed once every [X] minutes, by looking at all the new messages and processing them. Processing them consists of creating indexes based on threads, subjects, author, etc, exactly like now. State data could be stored in a .db (berkeley db, marshall, pickle, whatever) and need not even be protected by a lock, if done properly: When started, the indexing process should compare the mtime of the .db file with the current time, as well as the last added new message. If the .db file is too new, it should not run. If the .db file is old, and the newest message is newer, it should touch the .db file (leaving it exactly as it is, just updating mtime) and start working, keeping in mind to touch the file every now and then. The new .db file should be written to a temporary name, and rename()'d when it's done.
The individual archive message URL's would actually call a CGI script that looks up state data from the .db file, grab the cgi-escaped file, prepend and append standard header and footer containing 'next in thread', 'next in subject' and the like generated from the .db file, and feed that to the browser. If state data is absent, but the message itself is there, teh standard header and footer are added with a note saying that the message is too new to have links to other messages -- but at least the message will be visible.
Mail delivery and the Web-interface can be run on multitudes of servers, all sharing data over NFS, and still be able to communicate in a sane way, without waiting for eachother's locks forever. The only thing that can go wrong is for a message to be archived and sent out, and then read and looked up in the archives before the actual NFS data (the content of the message) is visible on the web-interface machine(s). Given the average speed of SMTP, I do not consider this a serious problem :)
One thing should run on a single server, however, and that's the archive-indexing process. Given that news-feeding and mailing out passwords should also be done on a single server, I don't see that as a problem either. And all of the above should work just as fine on a single machine doing everything on a local disk.
While under the 'the filesystem as the database' spell (hell, I'm still under that one) I also considered using the filesystem to index messages based on thread, subject, author, etc: a directory for each category, in which symlinks point to messages. To go from message 'X' to the next message in the thread, check where the symlink 'next-threads/X' points. The same for previous-thread, next-author, etc. I'm not sure if it will managable without adding a lockfile, nor how efficient it would be to generate overviews from such data.
Another thing for consideration is the 'downloadable mailbox' link. Should it download the maildir mailbox, or generate an mbox one on demand ? The maildir mailbox is *very* useful for incremental updates, since all that takes is checking which files you've already read. But, like I said, it's not as widely supported as the crappy ol' mbox format.
End-of-rant--we-will-now-return-to-your-regular-scheduled-static-ly y'rs, ;)
Thomas Wouters <thomas@xs4all.net>
Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

On Sat, 16 Jun 2001, Thomas Wouters wrote:
Maildir works by making every message a file on its own. A mailbox is a directory with three subdirectories, 'cur', 'new' and 'tmp'. Messages in 'new' are unread, messages in 'cur' are read, and messages in 'tmp' are in transit.
What about file system limitations to the maximum number of files in a directory?
I've not been a student at CMU for a long time, now, but while their mail system also does a file per message it is on a special FS--AFS.
-Dale
participants (2)
-
Dale Newfield
-
Thomas Wouters