Requirements for a new archiver
Hi,
For those of you that don't know I am currently working on a archive component for Mailman as part of my degree. The interface to the archive shall be based on the ideas in Ka-Ping Yee's paper on his Zest prototype. Over the past two weeks I have been looking at requirements and have the following. These are in no specific order. Due to the time constraints on my project (I am to only spend 200 hours in total on it, including writing reports, presentations etc) there is a limit to the amount I can do.
Functional Requirements The archive component should
- store email discussions.
- integrate with Mailman.
- provide a web-based interface to those email-discussions.
- provide an interface that threads discussions by their content. (ZEST)
- provide an interface that threads discussions by e-mail replies.
- allow for full-text searching of the archives.
- allow for filtering by date, author, and/or topic.
- be MIME aware.
- allow archives to be set as public or private.
- allow posts to be added, deleted, and modified through web interface.
- allow archives to be locked to prevent modification.
- allow postings to be emailed.
- allow postings to be referenced externally.
Non-Functional Requirements
- Maintainable
- Secure
- Scalable
The minimum I am planning on doing is the first 5 functional requirements restricted by the first 2 non-functional requirements.
There are a two reasons I am posting this.
Is there anything obvious that I have missed?
Which of the functional requirements, 6 to 13, do you feel are the most important? (As part of my report I have to analyse the requirements captured)
Any feedback is very much appreciated. Thanks in advance.
Iain
On Mon, 27 Oct 2003, Iain Bapty wrote:
- allow for full-text searching of the archives.
- allow for filtering by date, author, and/or topic.
The minimum I am planning on doing is the first 5 functional requirements
Which of the functional requirements, 6 to 13, do you feel are the most important? (As part of my report I have to analyse the requirements captured)
I find any archiver without at least 6 and likely 7 to be unusable, and an incredible waste of the user's time.
-Dale
On Mon, Oct 27, 2003 at 12:00:57PM +0000, Iain Bapty wrote:
Any feedback is very much appreciated. Thanks in advance.
My requirement list is at http://www.amk.ca/ng-arch/ArchiverRequirements . The ng-arch code is incomplete, but if it would be helpful (and you're allowed to use it), let me know and I can send you a copy.
BTW, feel free to use either the ng-arch Wiki or mailing list for purposes related to your project; both are pretty quiet, and your project is certainly on-topic for both. If you use the Wiki, just don't make extensive edits to any of my pages and create your own new pages, in case I dust off the project.
--amk
Iain said:
- store email discussions.
Iain,
To me, this is the single most important part. How do you intend to store the messages?
Maybe others don't give a fig but I think that if archived messages were to be stored in an easy-to-access database then life would be good. All of the wonderful things that people want to do with message data would be easy. Which is why I'm looking at using the Mail::Box Perl package to either read Pipermail mbox files or parse the messages from stdin via a dummy subscriber and alias. Either way, get the message parts into a widely implemented and simple-to-build-web-apps-with DB (my choice is MySQL).
I was thinking about using MHonarc to enhance the archive experience but it doesn't work with MySQL directly so Mail::Box just might be what the doctor ordered.
Maybe this direction is outside of what your scope is, but I'd still be interested in how you intend to store messages.
Thanks, Kevin
On Mon, 2003-10-27 at 15:06, Kevin McCann wrote:
To me, this is the single most important part. How do you intend to store the messages?
Maybe others don't give a fig but I think that if archived messages were to be stored in an easy-to-access database then life would be good.
I agree, although I don't know if I'd store everything in MySQL.
There are a couple of ways I could see slicing things. You could store one message per file a la MH, with some elaboration to avoid inode exhaustion. Or you could store everything in an mbox file with a file offset index. Or perhaps store everything to an nntp server (Twisted would make a nice platform for this <wink>).
What would then be in the database would be records providing easy lookup by message-id (at least) into the on-disk message store.
Also, I really want the next generation archiver to do everything through cgi (or equivalent programmatic interface). The ability to massage the messages on the way out to me outweighs the benefits of vending messages directly from the file system.
-Barry
On Mon, 2003-10-27 at 15:12, Barry Warsaw wrote:
On Mon, 2003-10-27 at 15:06, Kevin McCann wrote:
To me, this is the single most important part. How do you intend to store the messages?
Maybe others don't give a fig but I think that if archived messages were to be stored in an easy-to-access database then life would be good.
I agree, although I don't know if I'd store everything in MySQL.
I'd love to have these database fields in a messages table at my disposal:
id (unique to system, not message-id) listname subject date from body message-id references mime_headers
This would make it very easy to build useful and flexible web apps. The need is there. I can smell it. ;-) Bottom line, the easier you make access to all of the little bits of a message that are important in one way or another, the more widespread development will be. And the faster we'll see really, really cool mailing list-focused web apps that foster communication, collaboration and community building, all for the betterment of mankind.
:-)
- Kevin
There are a couple of ways I could see slicing things. You could store one message per file a la MH, with some elaboration to avoid inode exhaustion. Or you could store everything in an mbox file with a file offset index. Or perhaps store everything to an nntp server (Twisted would make a nice platform for this <wink>).
What would then be in the database would be records providing easy lookup by message-id (at least) into the on-disk message store.
Also, I really want the next generation archiver to do everything through cgi (or equivalent programmatic interface). The ability to massage the messages on the way out to me outweighs the benefits of vending messages directly from the file system.
-Barry
On Mon, 2003-10-27 at 16:37, Kevin McCann wrote:
I'd love to have these database fields in a messages table at my disposal:
id (unique to system, not message-id)
How do we calculate this? It probably ought to be globally unique, or at least locally unique to a Mailman installation. (Then again, what happens if you move a list?) It probably also shouldn't have any usable semantics -- i.e. be just an identifier. Maybe just a counter such as "124.mailman-developers.python.org"
listname subject date from body
This is the part I'm uncertain about. Is it better to store the body in the table, or on disk, with an index pointer in the table? I was speaking with Andrew Koenig about something similar, and he said he had a very fast algorithm for finding a message in an mbox file given its message id.
message-id
Which reminds me, I still want to revisit the "does Mailman have the right to mess with the Message-ID" issue.
references mime_headers
Why not all the headers?
-Barry
Barry Warsaw wrote:
On Mon, 2003-10-27 at 15:06, Kevin McCann wrote:
To me, this is the single most important part. How do you intend to store the messages?
Undecided, I am only just starting the development stage now (overlapping with the end of my requirements). This is a decision I will have to make over the next two weeks and as I am relatively inexperienced I shall be asking a lot of questions and doing lots of research. I included it as a requirement, even though it is an obvious one, so I can relate my design directly back to each requirement.
Maybe others don't give a fig but I think that if archived messages were to be stored in an easy-to-access database then life would be good.
I agree, although I don't know if I'd store everything in MySQL.
I have to explore as many of the options as time permits for my report. Although I like the idea of being able to do an SQL style query based on header information which would be stored as seperate fields.
There are a couple of ways I could see slicing things. You could store one message per file a la MH, with some elaboration to avoid inode exhaustion. Or you could store everything in an mbox file with a file offset index. Or perhaps store everything to an nntp server (Twisted would make a nice platform for this <wink>).
Twisted eh? I will have to look into that.
Also, I really want the next generation archiver to do everything through cgi (or equivalent programmatic interface). The ability to massage the messages on the way out to me outweighs the benefits of vending messages directly from the file system.
This is where my ignorance shines, could you elaborate a bit on this part please? By this, do you mean you want all queries to be setup and executed by a user through the web interface? Why can't messages be massaged from the file system?
Thanks
Iain
On Mon, 2003-10-27 at 16:44, Iain Bapty wrote:
Twisted eh? I will have to look into that.
Indeed. I'm using it in my Mailman3 experiments, and I think while Twisted is a big package, it gives us a lot of bang for the buck.
Also, I really want the next generation archiver to do everything through cgi (or equivalent programmatic interface). The ability to massage the messages on the way out to me outweighs the benefits of vending messages directly from the file system.
This is where my ignorance shines, could you elaborate a bit on this part please? By this, do you mean you want all queries to be setup and executed by a user through the web interface? Why can't messages be massaged from the file system?
In MM2 we made the conscious decision that public archives should be vended from the file system. That's why when you read the archives of this list through http://mail.python.org/pipermail/mailman-developers, an Alias directive maps that directly to a file on the file system. We were primarily concerned with the overhead of firing up a Python interpreter, extra processes, etc. for every archive hit. Note that private archives go through a cgi so they can enforce access rules. I think this was the right decision for the time.
Chuq made some convincing arguments that even public archive access should go through a script. By generating the viewed archive message on the fly, from its native source, we'd have all kinds of control over the presentation. Such as: changing the address obfuscation rules on the fly, the ability to retract or re-publish archive messages on the fly, more advanced threading options, no artificial date divisions, the ability to change the look and feel easily, etc. With proper caching machinery and the use of more modern programmatic fulfillment of web requests (e.g. mod_python, twisted, etc.), this should be efficient enough.
-Barry
and I'm working on an update of that based on some new ideas I have. stay tuned. (but don't hold your breath, not these days...)
FWIW, I vote for storing it in a database. By using MyISAM files and splitting on listname/time, you can build lots of smaller files and use merge tables to dynamically throw them together as needed, without building really bloody huge tables. a nice compromise, but you get all sorts of fun stuff that way, easy dynamic indexing, some usable search engine stuff, etc....
On Oct 27, 2003, at 2:03 PM, Barry Warsaw wrote:
Chuq made some convincing arguments that even public archive access should go through a script. By generating the viewed archive message on the fly, from its native source, we'd have all kinds of control over the presentation. Such as: changing the address obfuscation rules on the fly, the ability to retract or re-publish archive messages on the fly, more advanced threading options, no artificial date divisions, the ability to change the look and feel easily, etc. With proper caching machinery and the use of more modern programmatic fulfillment of web requests (e.g. mod_python, twisted, etc.), this should be efficient enough.
On Mon, 2003-10-27 at 18:33, Chuq Von Rospach wrote:
and I'm working on an update of that based on some new ideas I have. stay tuned. (but don't hold your breath, not these days...)
I can imagine, what with G5's, Windows iTunes and Panther. :)
FWIW, I vote for storing it in a database. By using MyISAM files and splitting on listname/time, you can build lots of smaller files and use merge tables to dynamically throw them together as needed, without building really bloody huge tables. a nice compromise, but you get all sorts of fun stuff that way, easy dynamic indexing, some usable search engine stuff, etc...
MyISAM tables aren't transactional. Would we care? Probably not for this application, but for my Mailman 3 experiments, I'm storing list and user data in transactional BerkeleyDB tables because I definitely think we want that extra safety.
-Barry
On Oct 27, 2003, at 3:47 PM, Barry Warsaw wrote:
MyISAM tables aren't transactional. Would we care? Probably not for this application, but for my Mailman 3 experiments, I'm storing list and user data in transactional BerkeleyDB tables because I definitely think we want that extra safety.
very unlikely for archives. And with mySQL 4, you can use one of the newer formats with row locking and transactions. they do intermingle nicely.
Chuq said:
very unlikely for archives. And with mySQL 4, you can use one of the newer formats with row locking and transactions. they do intermingle nicely.
Yes. MySQL can handle transactions just fine. For more info:
http://www.mysql.com/doc/en/ANSI_diff_Transactions.html
- Kevin
exhaustion. Or you could store everything in an mbox file with a file offset index. Or perhaps store everything to an nntp server (Twisted would make a nice platform for this <wink>). ... Also, I really want the next generation archiver to do everything through cgi (or equivalent programmatic interface). The ability to massage the messages on the way out to me outweighs the benefits of vending messages directly from the file system.
Well, since you bring this up.... I've been giving this some thought over the last few weeks, since this latest fit of discussions about archivers cropped up. I've written up some code to address the problem to my satisfaction, along with a quick draft manifesto to explain myself. It's too long to inline here, but I put a copy on the web:
http://home.uchicago.edu/~dgc/sw/mmimap/
Meanwhile, to cut to the chase: I decided IMAP is the way to handle this, and I've implemented what I need to provide it for both public and private lists. There are scripts to extract authentication material from Mailman, and an IMAP proxy daemon that performs authentication and sets up an environment to hand off to UW-IMAP.
I've tested on our production server with a restricted set of users. No complaints, and all the testers approve of the approach. Our server needs an upgrade before it's powerful enough to do IMAP for 2000 lists (67,000 subscribers), but it's tentatively the way we plan to go. We probably won't enable HTML archival after the upgrade. We already have a webmail product in place, but if we didn't we could just plug that in on the list server to provide the HTTP access.
I realize that IMAP isn't ideal for all sites or lists, but I think it should work well for our purposes, where lists are mostly institutional, and not so public that they need to be Googled.
I'm hoping to get these materials better integrated and documented soon, maybe once I'm back from LISA. But in case anyone is interested in working with them, I've put them up on the web, linked from the above URL. If this were to be a standard solution rather than a local hack, it would probably need some refactoring for other IMAP daemons, for newer MM authenticators, etc. I'm sure I haven't done the best as can be done, and I'd certainly rather see IMAP access to archives be a standard component of (or interface to) list server software, but it's a pleasing start.
-- -D. dgc@uchicago.edu University of Chicago > NSIT > VDN > ENSS > ENSA > You are here . . . . . . . always line up dots
On Mon, 2003-10-27 at 19:02, David Champion wrote:
I'm hoping to get these materials better integrated and documented soon, maybe once I'm back from LISA. But in case anyone is interested in working with them, I've put them up on the web, linked from the above URL. If this were to be a standard solution rather than a local hack, it would probably need some refactoring for other IMAP daemons, for newer MM authenticators, etc. I'm sure I haven't done the best as can be done, and I'd certainly rather see IMAP access to archives be a standard component of (or interface to) list server software, but it's a pleasing start.
One of the reasons why I'm so interested in Twisted for MM3 is so we can provide both IMAP and NNTP access to the message store, almost for free.
Which does point to an alternative direction -- maybe we don't need any direct connection to an html archive. Maybe the archiver should just be a separate process that reads messages from the NNTP interface a MM3 might export. Just blue-skying here.
-Barry
- On 2003.10.27, in <1067300194.1066.39.camel@anthem>,
- "Barry Warsaw" <barry@python.org> wrote:
Which does point to an alternative direction -- maybe we don't need any direct connection to an html archive. Maybe the archiver should just be a separate process that reads messages from the NNTP interface a MM3 might export. Just blue-skying here.
That's pretty much the ideological basis for what I have done. We have message-delivery protocols, and tools that know about messages; why keep trying to reinvent them over HTTP? My ideal list manager would export IMAP and/or NNTP interfaces, or would have a channel for providing messages and authentication to something else that exposes IMAP or NNTP (which is the route I took). Nobody needs web access: what they need is access via a web browser. With browsers that understand NNTP and IMAP prevalent, and with a wide selection of web-mail and web-news gateways for the cases where that doesn't work, this is sufficient.
I favor IMAP over NNTP for this:
it appeals more to the way regular people think about lists: it's their mail, only it's on a server. Most people aren't much aware of or concerned with the similarities between news/NNTP and mail/IMAP.
many people have IMAP software. Fewer have or understand how to use NNTP software.
my server has mostly private lists, and I'm unsatisfied with the state of NNTP authentication compared to IMAP authentication. I want this primarily for archives that people need to authenticte to, not lists whose archives should be exposed to the public.
But integrating with both is even better.
-- -D. dgc@uchicago.edu University of Chicago > NSIT > VDN > ENSS > ENSA > You are here . . . . . . . always line up dots
On Oct 28, 2003, at 12:30 PM, David Champion wrote:
message-delivery protocols, and tools that know about messages; why keep trying to reinvent them over HTTP?
because once you leave the niche of dealing with your fellow geeks, that's what users are going to want. browser access. NNTP is simply a non-issue any more, and iMap is fine, but they know how to go to a URL, don't assume they can reconfigure their mailer.
Not saying don't do this, but if you write geek tools for geeks, you'll lose the rest of your audience, the non-technical users.
Nobody needs web access: what they need is access via a web browser. With browsers that understand NNTP and IMAP prevalent, and with a wide selection of web-mail and web-news gateways for the cases where that doesn't work, this is sufficient.
is it? it seems to me to (frankly) be a real hack with bad navigation, at least the stuff I've seen. I'd be happy to be proven wrong.
- many people have IMAP software.
and in many cases, it's set up by someone else, and they have no clue how to tweak it on their own, or interest.
And for intermittent or one-time access to an archive? won't bother. And how does it get into google so they know to look at it in the first place?
I'm not really thrilled with this avenue. sorry.
- On 2003.10.28, in <20485453-0988-11D8-A02B-0003934516A8@plaidworks.com>,
- "Chuq Von Rospach" <chuqui@plaidworks.com> wrote:
because once you leave the niche of dealing with your fellow geeks, that's what users are going to want. browser access. NNTP is simply a
It's the "non-geeks" I'm trying to help: I support 25,000 of them, and I really don't worry much about the "geeks". They know how to do for themselves.
non-issue any more, and iMap is fine, but they know how to go to a URL, don't assume they can reconfigure their mailer.
Where I work -- and I know that it's not like this everywhere, but I have to assume we're not the only place like this -- we configure users' mailers for them initially. (So we can configure in access to our list server(s).) We have a telephone support line that regularly works people through mailer issues. Here, reconfiguring a mailer is not a hard problem, compared to getting usable HTML archives in a supportable server configuration.
Not saying don't do this, but if you write geek tools for geeks, you'll lose the rest of your audience, the non-technical users.
Agreed, but I don't think I'm proposing "geek tools". I'm trying to establish a shared pathway for getting into a message archive that lets geeks use their tools, and non-geeks use theirs, equally.
Nobody needs web access: what they need is access via a web browser. With browsers that understand NNTP and IMAP prevalent, and with a wide selection of web-mail and web-news gateways for the cases where that doesn't work, this is sufficient.
is it? it seems to me to (frankly) be a real hack with bad navigation, at least the stuff I've seen. I'd be happy to be proven wrong.
What seems to have bad navigation? I'm not sure what component you mean. I would say that webmail programs generally are awful, but I know that 40% of my users love using them. I think it's also relevant that every web-based list archive I've ever used is atrocious for navigation; their only selling points seem to be ease of referral and indexing. (And yes, I agree that these are important elements.)
But granted, this is an overzealous assessment. I should say: IMAP and NNTP access are sufficient for certain environments of which I believe mine is an example.
And for intermittent or one-time access to an archive? won't bother. And how does it get into google so they know to look at it in the first place?
Again, I'm not talking (for the most part) about public-access lists. I'm talking about private communities consisting mostly of people within a common real-world context. Perhaps I should have made that more clear. This happens a lot: I seriously doubt that most mailing lists, even most Mailman mailing lists, are public.
I'm not really thrilled with this avenue. sorry.
Don't be sorry. I want to google certain lists as much as the next person, and I know that this model doesn't work as well as HTML archives in that respect (though I will note as a sidebar that Google happily indexes NNTP servers). I'm not trying to kill the web archive go before its time, and nothing I've proposed obviates having one. All I've described is a parallel mode of access that I believe is more appropriate and more useful in some settings.
We're already plugging external archivers into Mailman now, and nothing in this suggestion prevents us from continuing to do that. The only potential change, I would say, is that in one design, archivers would pull from NNTP, IMAP, or a message store, rather than actively being fed articles. I don't particularly advocate that, lacking a better understanding of the internals of the list server. I take no issue with leaving in a means of delivering messages to archivers, I just would like to see it become one of several access messsage channels, preferably all with some shared interface to the core processor.
-- -D. dgc@uchicago.edu University of Chicago > NSIT > VDN > ENSS > ENSA > You are here . . . . . . . always line up dots
That's pretty much the ideological basis for what I have done. We have message-delivery protocols, and tools that know about messages; why keep trying to reinvent them over HTTP?
There is a huge demand for web applications that use mailing list data. Mailing list archives in easily accessibly databases will lead to killer community-building apps that *build* on the mailing list archives but offer other resources.
NNTP access is fine, go ahead. And IMAP all you want. But I really hope that the Mailman development community does not dismiss the *very strong desire* for flexible web scripting access to the goods. As far as I'm concerned, this is the only thing that's really holding Mailman back from being the tour de force product that it could be.
I feel like I'm beating a dead horse, and I apologize if I'm being a pain-in-the-ass with this, but I think it's important.
- Kevin
El Martes, 28 de Octubre de 2003 22:20, Kevin McCann escribió:
That's pretty much the ideological basis for what I have done. We have message-delivery protocols, and tools that know about messages; why keep trying to reinvent them over HTTP?
There is a huge demand for web applications that use mailing list data. Mailing list archives in easily accessibly databases will lead to killer community-building apps that *build* on the mailing list archives but offer other resources.
NNTP access is fine, go ahead. And IMAP all you want. But I really hope that the Mailman development community does not dismiss the *very strong desire* for flexible web scripting access to the goods. As far as I'm
Pardon my ignorance, what do you mean by "flexible web scripting access"? Could you elaborate further? I am currently involved in a project which consists on adding cross-lingual capabilities to a mailing list manager[1] which to a great extend has to do with the *content* of the e-mails posted to the list.
[1] CroMaLiM: A Crosslingual Mailing List Manager: http://www.sasaska.net/cromalim/index.html
Thanks a lot for your time!
/Rafa
concerned, this is the only thing that's really holding Mailman back from being the tour de force product that it could be.
I feel like I'm beating a dead horse, and I apologize if I'm being a pain-in-the-ass with this, but I think it's important.
- Kevin
Mailman-Developers mailing list Mailman-Developers@python.org http://mail.python.org/mailman/listinfo/mailman-developers
-- Rafael Cordones Marcos rcm@sasaska.net http://www.sasaska.net
At 2:30 PM -0600 2003/10/28, David Champion wrote:
Nobody needs web access: what they need is
access via a web browser. With browsers that understand NNTP and IMAP prevalent, and with a wide selection of web-mail and web-news gateways for the cases where that doesn't work, this is sufficient.
You can't assume a homogenous client mix, one where a single
program does everything. There are way more phone users than there are computer users, and the number of mobile phones in a growing number of countries exceeds the number of fixed lines. Mobile access to the web will be the next killer app. However, most of those phones might have some sort of a browser, but e-mail support would be from a separate program, and USENET news clients would be non-existent.
You cannot assume a homogenous client mix. Moreover, you can't
assume broad support for less common protocols like IMAP or NNTP.
- many people have IMAP software. Fewer have or understand how to use NNTP software.
Many more people have access to some sort of web browser than
they do some sort of IMAP client. If you're going to do lowest-common-denominator, then IMAP loses.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
At 3:12 PM -0500 2003/10/27, Barry Warsaw wrote:
What would then be in the database would be records providing easy lookup by message-id (at least) into the on-disk message store.
Putting meta-data into the database would work. Then use that
index information to actually access the files. I recommended the same in my invited talk at <http://www.shub-internet.org/brad/papers/dihses/>.
Of course, if you're going to use a USENET interface, you should
use Diablo as the back-end. ;-)
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
At 3:06 PM -0500 2003/10/27, Kevin McCann wrote:
I was thinking about using MHonarc to enhance the archive experience but it doesn't work with MySQL directly so Mail::Box just might be what the doctor ordered.
No database handles "BLOB" (Binary Large OBject) storage well.
Even high-end databases have problems in this area. IMO, this is a bad idea.
Better would be to use a mailbox format that handles simultaneous
multiple access reasonably well. You can use c-client and mbx format, or MH format, or something else reasonably decent.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
On Wed, 2003-10-29 at 10:13, Brad Knowles wrote:
At 3:06 PM -0500 2003/10/27, Kevin McCann wrote:
I was thinking about using MHonarc to enhance the archive experience but it doesn't work with MySQL directly so Mail::Box just might be what the doctor ordered.
No database handles "BLOB" (Binary Large OBject) storage well. Even high-end databases have problems in this area. IMO, this is a bad idea.
Agreed. I was thinking more along the lines of storing the message body as is, which, yes, might sometimes be base-64 encoded. Content headers, boundary string, etc. could also be stored so as to make decoding (by a web app) a cinch. You could go further and create attachment files and point to it in an url or file field. But keep the message intact, as it was received. That way if you want to get into after-the-fact message delivery (manual resend, or maybe a member missed a message and wants it in his/her inbox), it's not a chore.
The Messages_ table that Lyris uses in its database is a good starting point if one wants to do the same kind of thing. I can dig up the specs if there is interest.
- Kevin
Brad Knowles <brad.knowles@skynet.be> writes:
At 3:06 PM -0500 2003/10/27, Kevin McCann wrote:
I was thinking about using MHonarc to enhance the archive experience but it doesn't work with MySQL directly so Mail::Box just might be what the doctor ordered.
No database handles "BLOB" (Binary Large OBject) storage well. Even high-end databases have problems in this area. IMO, this is a bad idea.
Better would be to use a mailbox format that handles simultaneous multiple access reasonably well. You can use c-client and mbx format, or MH format, or something else reasonably decent.
Hmm... Maildirs. With just a bit of minor trickery the unique filename created to receive a message as it arrives at Mailman might be put into the saved rfc822 header (much like MTAs place a queue id), or into the message trailer if you must, and perhaps could be preserved in the filename as the message is moved/copied from one directory to another and thereby providing a unique index that can be included in the message Mailman puts on the wire.
jam
At 1:28 PM -0500 2003/10/29, John A. Martin wrote:
Hmm... Maildirs.
Not.
From <http://www.washington.edu/imap/documentation/formats.txt.html>:
. mh This is supported for compatibility with the past. This is the format used by the old mh program.
mh is very inefficient; the entire directory must be read
and each file stat()'d, and in order to determine the size
of a message, the entire file must be read and newline
conversion performed.
mh is deficient in that it does not support any permanent
flags or keywords; and has no means to store UIDs (because
the mh "compress" command renames all the files, that's
why).
[ ... deletia ... ]
The Maildir format used by qmail has all of the performance disadvantages of mh noted above, with the additional problem that the files are renamed in order to change their status so you end up having to rescan the directory frequently the current names (particularly in a shared mailbox scenario). It doesn't scale, and it represents a support nightmare;
[ ... deletia ... ]
So what does this all mean?
A database (such as used by Exchange) is really a much better
approach if you want to move away from flat files. mx and especially Cyrus take a tenative step in that direction; mx failed mostly because it didn't go anywhere near far enough. Cyrus goes much further, and scores remarkable benefits from doing so.
However, a well-designed pure database without the overhead of
separate files would do even better.
Of course, we all know about the database problems of Exchange,
and how Exchange admins have to frequently shut everything down and clean their databases, how often they crash, how often they completely trash all e-mail for all their users, etc....
I submit that the reason for this is the combination of crappy
Microsoft-style programming and the fact that no database handles BLOBs well. Even top-notch programmers have real problems with these kinds of implementations -- I am intimately familiar with the database implementation methods used in the AOL mail system, and suffice it to say that this is a really, really hairy nightmare that you do *NOT* want.
That said, storing meta-data in a real database and then using
external filesystem techniques for actually accessing the data, should give you the best of both worlds -- the speed of access of the database, and the reliability and well-understood access and backup mechanisms of filesystems.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
On Oct 29, 2003, at 10:45 AM, Brad Knowles wrote:
That said, storing meta-data in a real database and then using external filesystem techniques for actually accessing the data, should give you the best of both worlds -- the speed of access of the database, and the reliability and well-understood access and backup mechanisms of filesystems.
Hint: look at what INN did when they implmented cycbufs.
Effectively, you create 1-N files, or create files as needed. Each file is N bytes long, pre-allocated on file creation. When you store messages, they're written into the file sequentially (or any other way you want. If you want to get into best fit allocations and turn this into a malloc() style heap, be my guest).
Metadata to access the info is then a filename, and an lseek() pointer into the file, and # of bytes to read, plus your normal identifying info. It's fast, it's efficient use of file pointers, it avoids the worst aspects of the unix file system, and I'm amazed nobody ever thinks to use it for other purposes (or that it took that long for usenet people to discover it, I suggested a simpler variant of it back in the 80s and was told inodes are our friends...)
you can even do expiration/purge/etc if you want, by moving stuff around and changing the pointers.
I've even thought of using it as the backing store for a picture library. With a nice relational database and a series of these "data boxes", I think you have store data in the best and fastest possible way...
At 11:38 AM -0800 2003/10/29, Chuq Von Rospach wrote:
Hint: look at what INN did when they implmented cycbufs.
I did. See <http://www.shub-internet.org/brad/papers/dihses/>.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
On Wed, 2003-10-29 at 14:38, Chuq Von Rospach wrote:
Hint: look at what INN did when they implmented cycbufs.
Effectively, you create 1-N files, or create files as needed. Each file is N bytes long, pre-allocated on file creation. When you store messages, they're written into the file sequentially (or any other way you want. If you want to get into best fit allocations and turn this into a malloc() style heap, be my guest).
Metadata to access the info is then a filename, and an lseek() pointer into the file, and # of bytes to read, plus your normal identifying info. It's fast, it's efficient use of file pointers, it avoids the worst aspects of the unix file system, and I'm amazed nobody ever thinks to use it for other purposes (or that it took that long for usenet people to discover it, I suggested a simpler variant of it back in the 80s and was told inodes are our friends...)
I'm not sure if Andrew Koenig is on this list, but he described an algorithm he developed to quickly find messages in an mbox file. If he's here, maybe he can talk about it.
From lines in the body of the message. MMDF would be better, but I
I really don't like mbox files, primarily because they require munging think ideal from a philosophical point of view would be one-message-per-file if it can be done efficiently cross-platform. Maybe file system experts here can provide pointers or advice on exactly which file and operating systems make this approach feasible, even for huge message counts.
you can even do expiration/purge/etc if you want, by moving stuff around and changing the pointers.
I've even thought of using it as the backing store for a picture library. With a nice relational database and a series of these "data boxes", I think you have store data in the best and fastest possible way...
It's a very interesting idea.
-Barry
At 10:47 PM -0500 2003/10/29, Barry Warsaw wrote:
I'm not sure if Andrew Koenig is on this list, but he described an algorithm he developed to quickly find messages in an mbox file. If he's here, maybe he can talk about it.
7th edition mbox files are a pain. There are other mailbox file
formats that are much better and easier to parse (UW-IMAP .mbx being one).
I really don't like mbox files, primarily because they require munging From lines in the body of the message. MMDF would be better, but I think ideal from a philosophical point of view would be one-message-per-file if it can be done efficiently cross-platform.
Therein lies the problem. Some filesystems make this more
feasible than others, at least on larger scale systems.
Maybe file system experts here can provide pointers or advice on exactly which file and operating systems make this approach feasible, even for huge message counts.
SGIs XFS on Irix does a pretty good job, with hashed directory
structures, and an extent-based journaling filesystem. Regretfully, I don't think that all of these features are fully supported under the Linux version of XFS, and that work has basically ground to a halt with the lay-offs of all the key SGI people who had been working on XFS. Veritas VxFS also does a good job in this area.
Other than SGI XFS for Irix and Veritas VxFS, I don't know of any
good solutions to this problem at the filesystem level.
Kirk McKusick and Eric Allman agree with you that this is a
proper filesystem problem that should be solved at the filesystem level (at least, that's what they've said to me when I brought this issue up to them), and they feel you should not attempt to solve filesystem problems with "tricks" like INN timecaf/timehash cycbufs.
However, while that's nice in theory, that doesn't necessarily
help us here in the real world.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
And since Barry's underlying philosophy is to minimize the "number of things Mailman depends on", that sort of lets out depending on them having an OS with a high-performance journaling filesystem, no? (giggle)
On Oct 29, 2003, at 8:00 PM, Brad Knowles wrote:
However, while that's nice in theory, that doesn't necessarily help us here in the real world.
On Thu, Oct 30, 2003 at 05:00:48AM +0100, Brad Knowles wrote:
SGIs XFS on Irix does a pretty good job, with hashed directory structures, and an extent-based journaling filesystem. Regretfully, I don't think that all of these features are fully supported under the Linux version of XFS, and that work has basically ground to a halt with the lay-offs of all the key SGI people who had been working on XFS. Veritas VxFS also does a good job in this area.
[ A cursory google search indicates that hashed dirs, extents, and journalling are all in linux xfs. I can't imagine an unsupported feature making its way into the filesystem that SGI is putting on its latest and greatest systems, but if you know about this, please share ]
In the case of a one-file-per-message approach, my experience with vxfs is that it creates a rather slow filesystem when you get your filesystem to the point of haing with a few hundred thousand small files (lots of wasted space in the extents and I believe, though I may be wrong, that there were lots of metadata lookups through multiple layers of indirections slowing things down).
However reiserfs was built to handle a mix of lots of small files, ala maildir or mh spools.
I'm not too current on current bsd going-ons, but I'd bet that ffs2 has something to offer in this arena, too, since it looks like it almost does extent-based allocation now.
Kirk McKusick and Eric Allman agree with you that this is a proper filesystem problem that should be solved at the filesystem level (at least, that's what they've said to me when I brought this issue up to them), and they feel you should not attempt to solve filesystem problems with "tricks" like INN timecaf/timehash cycbufs.
Err... then to relate this to a prior post, why not just use maildirs on filesystems that are engineered to handle that sort of thing?
However, while that's nice in theory, that doesn't necessarily help us here in the real world.
Unless you are using a filesystem that works for this, right? Like xfs, vxfs, reiserfs, and probably ffs2. I believe that linux's ext3 has support for hashing directories (or soon will - I don't precisely know as I've been focusing on other things)
-Peter
-- The 5 year plan: In five years we'll make up another plan. Or just re-use this one.
And windows? And older hardware? Solaris 8? Hell, solaris 6 and 7?
You going to depend on people only running year-old-or-less hardware and OS?
On Oct 29, 2003, at 8:35 PM, Peter C. Norton wrote:
I'm not too current on current bsd going-ons, but I'd bet that ffs2 has something to offer in this arena, too, since it looks like it almost does extent-based allocation now.
At 8:35 PM -0800 2003/10/29, Peter C. Norton wrote:
[ A cursory google search indicates that hashed dirs, extents, and journalling are all in linux xfs. I can't imagine an unsupported feature making its way into the filesystem that SGI is putting on its latest and greatest systems, but if you know about this, please share ]
My understanding is that the port of XFS to Linux was only about
70% done at the time the critical software engineers were laid off by SGI, and that no further work in this area has been done. Maybe the features are supposedly there but incomplete.
However reiserfs was built to handle a mix of lots of small files, ala maildir or mh spools.
I'm sorry, I don't trust ReiserFS at all. I'd trust XFS if it
was on Irix, or IBMs JFS, but not ReiserFS. Hell, on a Linux system, I'd use ext2fs before I'd use Reiser.
I'm not too current on current bsd going-ons, but I'd bet that ffs2 has something to offer in this arena, too, since it looks like it almost does extent-based allocation now.
No, not yet. There are improvements in the areas of handling
synchronous meta-data updates, background fsck, etc... but nothing like extent-based filesystems or integrated hashed directory schemes, etc....
Err... then to relate this to a prior post, why not just use maildirs on filesystems that are engineered to handle that sort of thing?
Because we can't guarantee that everyone (or anyone) would be
willing/able to use the selected filesystems that we have blessed? You think requiring everyone to install PostgreSQL would be bad, do you really want to try to force them all to use ReiserFS on Linux as their only supported option?
Unless you are using a filesystem that works for this, right? Like xfs, vxfs, reiserfs, and probably ffs2. I believe that linux's ext3 has support for hashing directories (or soon will - I don't precisely know as I've been focusing on other things)
My understanding is that ext3fs is dead. The work that Stephen
Tweedie had been doing stopped long ago, and even then it was only a minor tweak over ext2fs. I don't believe that this work has been picked up again or extended to include other features.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
On Wed, Oct 29, 2003 at 07:45:53PM +0100, Brad Knowles wrote:
At 1:28 PM -0500 2003/10/29, John A. Martin wrote:
Hmm... Maildirs.
Not.
From <http://www.washington.edu/imap/documentation/formats.txt.html>:
[deletia]
I don't know why a reasonable person would cite documentation pertaining to UW-IMAP, a server that has been a standards, security and performance bummer.
Why not cite http://www.courier-mta.org/mbox-vs-maildir/?
<quote>
Painting "just about" every filesystem in existence with the same brush, and assuming that every filesystem works pretty much in the same way, is very misleading. Many contemporary high performance filesystem are designed explicitly for parallel access. For example, consider the SGI XFS filesystem:
The free space and inodes within each AG are managed independently
and in parallel so multiple processes can allocate free space
throughout the file system simultaneously.[2]
It took me about 6 months to write the first revision of the maildir-based Courier-IMAP server. The absence of maildir support in the UW-IMAP server is the reason I wrote it. Many people have found that it needed less memory, and was faster than UW-IMAP. Many people observed that upgrading to Courier-IMAP lowered their overall system load, and increased performance. Large mail clusters with a network-based fault tolerant, scalable, architecture frequently have problem deploying mbox-based mailboxes, due to many documented problems with file locking (file locking is required for mbox-based mailboxes) with network-based filesystems.[3] As referenced in [3], maildirs have no issues with NFS (the most common type of a network-based filesystem) since maildirs do not use locking.
After looking around for some time, I did not find any independent benchmarks that directly measured the relative performance of mboxes and maildirs. Therefore I decided to run some actual benchmarks myself. I defined the test conditions according to UW-IMAP server's documentation. I created a test environment that stacked the deck in favor of mboxes. This was done in accordance with the claimed shortcomings of maildirs as stated in UW-IMAP server's documentation, in order to accurately measure the magnitude of the claimed problems. </quote>
and at the end:
<quote>
The final conclusion is that -- except in some specific instances -- using maildirs will be just as fast -- and in sometimes much faster -- than mbox files, while placing less of a load on the rest of the mail system. The claims in the UW-IMAP server's documentation regarding maildir performance can be supported only in certain, specific, very narrowly-defined conditions. There is no simple answer on which mail storage format is better. A lot depends on many variables that vary widely in different situations. Besides the raw benchmarks shown above, other factors include the mail server software being used, what kind of storage is being used, and the available network bandwidth. The final answer depends on all of the above.
</quote>
[flame-bait deleted]
A database (such as used by Exchange) is really a much better
approach if you want to move away from flat files. mx and especially Cyrus take a tenative step in that direction; mx failed mostly because it didn't go anywhere near far enough. Cyrus goes much further, and scores remarkable benefits from doing so.
However, a well-designed pure database without the overhead of
separate files would do even better.
It always confounds me that people will go for database voodoo and deride filesystems when a filesystem is a highly specialised database in and of itself. Putting things that are in a filesystem into a database offers the power and flexability of querying, but certianly should not be done for the sake of speed (assuming the filesystem-based implementation meets whatever other requirements are present).
Of course, we all know about the database problems of Exchange, and how Exchange admins have to frequently shut everything down and clean their databases, how often they crash, how often they completely trash all e-mail for all their users, etc....
Which is a good lesson about databases: because of their flexability, they cannot be qa'd to cope with all of their uses without being put into production and losing data and being subsequently fixed. Filesystems, which have a more narrowly-defined scope, tend to suffer this less. Thats why database logs that live on filesystems are used for data recovery when a database eats itself.
I submit that the reason for this is the combination of crappy Microsoft-style programming and the fact that no database handles BLOBs well. Even top-notch programmers have real problems with these kinds of implementations -- I am intimately familiar with the database implementation methods used in the AOL mail system, and suffice it to say that this is a really, really hairy nightmare that you do *NOT* want.
Databases aren't meant to be storage for abstract binary data. They're meant to be a searchable index of data of types they understand.
Assuming I had a clean slate to start a database project for a mail store, personally I'd much rather prototype it in something like postgresql where I could add data types to deal with email. I could then make header types, text types, mime types classes, etc. Then I could test to see if it was a good idea to implement it.
That said, storing meta-data in a real database and then using external filesystem techniques for actually accessing the data, should give you the best of both worlds -- the speed of access of the database, and the reliability and well-understood access and backup mechanisms of filesystems.
I think using a standard sql database for doing mail operations is asking for trouble. Standard databases don't know how to parse rfc822/2822 headers and that means that you've got to either write a whole lot of stored procedures in a clunky query language (or java!?!?!) and then maintain it, or you've got to do it all in the imap/pop3/whatever server which means a whole lot of yammering traffic between the database and the I/P/W server all the time, which == slow.
-Peter
-- The 5 year plan: In five years we'll make up another plan. Or just re-use this one.
At 11:54 AM -0800 2003/10/29, Peter C. Norton wrote:
It always confounds me that people will go for database voodoo and deride filesystems when a filesystem is a highly specialised database in and of itself.
I am aware of that. I was aware of that when I first gave my
invited talk entitled "Design and Implementation of Highly Scalable E-mail Systems", which you can find at <http://www.shub-internet.org/brad/papers/dihses/>.
Note that Eric Allman (author of the original Ingres database,
among many other things) and Kirk McKusick (author of the Berkeley Fast File System) were in the audience. I did not embarrass myself.
Databases aren't meant to be storage for abstract binary data. They're meant to be a searchable index of data of types they understand.
Correct. And despite all claims to the contrary from the
vendors, no database properly "understands" binary large objects, nor do they give you another datatype they do actually understand that would be suitable for the storage of e-mail message bodies.
Assuming I had a clean slate to start a database project for a mail store, personally I'd much rather prototype it in something like postgresql where I could add data types to deal with email. I could then make header types, text types, mime types classes, etc. Then I could test to see if it was a good idea to implement it.
IMO, that would be an exercise in futility. We've been down this
road a million times before. We don't need to go down it again to know that the result is not likely to be successful, especially when we have alternatives that are proven to work well -- we store the message meta-data in the database, and then the message bodies in an separate message store akin to INN timecaf/timehash "heaps" (see <http://www.shub-internet.org/brad/papers/dihses/lisa2000/sld090.htm>).
I think using a standard sql database for doing mail operations is asking for trouble. Standard databases don't know how to parse rfc822/2822 headers and that means that you've got to either write a whole lot of stored procedures in a clunky query language (or java!?!?!) and then maintain it, or you've got to do it all in the imap/pop3/whatever server which means a whole lot of yammering traffic between the database and the I/P/W server all the time, which == slow.
You don't ask the database to understand or parse RFC2822 headers
or messages. That's up to your application. You just store data using the formats known to the database, and the message bodies according to the methods above.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
On Wed, Oct 29, 2003 at 09:25:53PM +0100, Brad Knowles wrote:
Assuming I had a clean slate to start a database project for a mail store, personally I'd much rather prototype it in something like postgresql where I could add data types to deal with email. I could then make header types, text types, mime types classes, etc. Then I could test to see if it was a good idea to implement it.
IMO, that would be an exercise in futility. We've been down this road a million times before. We don't need to go down it again to know that the result is not likely to be successful, especially when we have alternatives that are proven to work well -- we store the message meta-data in the database, and then the message bodies in an separate message store akin to INN timecaf/timehash "heaps" (see <http://www.shub-internet.org/brad/papers/dihses/lisa2000/sld090.htm>).
It seems like you're only partially agreeing/disagreeing with me (optimist/pessamist). Disagreeing: you're saying that using datatypes in the database which are appropriate to the kind of data being stored (mail messages) is an excercise in futility. But, agreeing: that storing these in a database in another way is OK. I don't get why you'd just want to store these as text when you have databases that can be made more suitable to the problem.
I think using a standard sql database for doing mail operations is asking for trouble. Standard databases don't know how to parse rfc822/2822 headers and that means that you've got to either write a whole lot of stored procedures in a clunky query language (or java!?!?!) and then maintain it, or you've got to do it all in the imap/pop3/whatever server which means a whole lot of yammering traffic between the database and the I/P/W server all the time, which == slow.
You don't ask the database to understand or parse RFC2822 headers or messages. That's up to your application. You just store data using the formats known to the database, and the message bodies according to the methods above.
So all the parsing happens in the database client side. Which is slow.
-Peter
-- The 5 year plan: In five years we'll make up another plan. Or just re-use this one.
At 12:37 PM -0800 2003/10/29, Peter C. Norton wrote:
It seems like you're only partially agreeing/disagreeing with me (optimist/pessamist). Disagreeing: you're saying that using datatypes in the database which are appropriate to the kind of data being stored (mail messages) is an excercise in futility.
Not quite. I believe that there are no databases in existence
which have data types that are actually appropriate for the storage of message bodies.
But, agreeing: that
storing these in a database in another way is OK.
Not quite. Store meta-data, yes. The entire message, no.
Store things like who the message is from, who the message is
addressed to, the date, the message-id as it was found in the headers, etc.... Basically, store just about everything in the message headers that a client would be likely to ask about. That's all well and good.
But when it comes to storing the message body itself, it should
be stored in wire format (i.e., precisely as it came in), in the filesystem. Then pointers to the location in the filesystem should be put into the database.
One key factor here is that all of the information in the
database should be able to be re-created from the message bodies alone, if there should happen to be a catastrophic system crash.
The sole purpose of the database is to speed up access to the
messages and the message content -- indeed, to speed it up enough so that randomly accessing most any piece of information about any message from any sender to any recipient in any mailbox should become something feasible to contemplate.
The sole purpose of the database is to make the difficult and
slow (on the large scale) quick and easy, and to make the things that would be totally impossible (on any reasonable scale) at least something that can now be considered.
I don't get why
you'd just want to store these as text when you have databases that can be made more suitable to the problem.
I don't believe that there are any databases in existence that
"... can be made more suitable to the problem."
So all the parsing happens in the database client side. Which is slow.
Yup. I don't see any way around that.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
On Wed, Oct 29, 2003 at 10:14:52PM +0100, Brad Knowles wrote:
I don't believe that there are any databases in existence that "... can be made more suitable to the problem."
In theory you can add data types to postgresql. Not that I've done it myself, but its been done.
-Peter
-- The 5 year plan: In five years we'll make up another plan. Or just re-use this one.
On Wed, 2003-10-29 at 16:59, Peter C. Norton wrote:
In theory you can add data types to postgresql. Not that I've done it myself, but its been done.
I wouldn't want to build a system that required PostgreSQL. Maybe we can hide all the gore behind an interface, maybe not.
After all, we're using Python here (that's not going to change) so speed is all relative. Let's analyze what we're going to use this message store for too. We should be able to come up with a fast-enough solution for most sites. Really huge sites are probably going to cook their own dog food and won't even look at Mailman. There should be enough flexibility in the framework to allow the sites in the middle to scale Mailman up with some extra effort.
Note: I don't think the message store gets in the picture for the qrunners. We can probably improve performance here, but that's an entirely different problem than the long-term message store we're talking about.
-Barry
On Wed, 2003-10-29 at 16:14, Brad Knowles wrote:
One key factor here is that all of the information in the database should be able to be re-created from the message bodies alone, if there should happen to be a catastrophic system crash.
Just to be dense, let me ask for clarification: by "message body" you mean the entire original message, as received on the wire, not just the message payload (i.e. sans RFC 2822 headers). If so, I agree completely.
But I also think the decoded message should be stored on the file system somehow as well. I.e. decode attachments and store then as separate files too.
-Barry
At 11:18 PM -0500 2003/10/29, Barry Warsaw wrote:
Just to be dense, let me ask for clarification: by "message body" you mean the entire original message, as received on the wire, not just the message payload (i.e. sans RFC 2822 headers). If so, I agree completely.
Yes, you are correct. At issue is that there might be some
headers which some users might wish to search on (or maybe just see) which might not be put into one or more of the fields, and you don't want to take the risk of losing those by assuming that you can always re-generate all the headers from what you've stored inside the database.
But I also think the decoded message should be stored on the file system somehow as well. I.e. decode attachments and store then as separate files too.
My experience is that this is a bad idea. However, if the
implementation is fully modularized at the API level, then we can always rip out the mailman solution and instead put in something that actually works.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
On Wed, 2003-10-29 at 13:45, Brad Knowles wrote:
That said, storing meta-data in a real database and then using external filesystem techniques for actually accessing the data, should give you the best of both worlds -- the speed of access of the database, and the reliability and well-understood access and backup mechanisms of filesystems.
I'm strongly in favor of this kind of approach. I don't know what the best on-disk storage format is (although cycbuf sounds interesting), but I'm pretty sure we want the raw messages stored as plain files on the file system.
We may even want both the encoded and decoded messages stored on the file system -- at the very least, we should have attachments decoded and stored in separate files. Then we want metadata about the messages stored in a database. We should be able to regenerate or update the metadata by trolling over the raw message storage, and we should be able to vend messages from the message store via any number of protocols.
The message store should be a central component of Mailman, but it should be defined by an interface in case we decide to change the implementation of the message store.
-Barry
At 10:43 PM -0500 2003/10/29, Barry Warsaw wrote:
We may even want both the encoded and decoded messages stored on the file system -- at the very least, we should have attachments decoded and stored in separate files.
I'm not at all sure that you want to go down this route. One of
the biggest headaches within the AOL mail system is the handling of attachment storage, what happens when the attachments get out of sync, etc.... Same with Eudora as a local MUA.
I think I'd be inclined to store the message in wire format just
the once, and deal with transformation on the fly. At least you would never have to worry about the attachments getting out of sync with the message bodies, and maybe someone else getting the attachments they weren't supposed to see, or not getting the attachments at all that they should have, etc....
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
On Mon, Oct 27, 2003 at 12:00:57PM +0000, Iain Bapty wrote:
Which of the functional requirements, 6 to 13, do you feel are the most important?
I think they are all quite basic. I think personally 10 and 12 are 'should-haves' in my opinion.
Also have a look at the "SMART Archiver" project, http://sourceforge.net/projects/smartarchiver/
A replacement for the standard GNU Mailman archiver that
supports attachments, searching, date selection, message
editing and more, requires a database such as postgressql
(mysql support is coming in a future version)
... Is there anything obvious that I have missed?
Other requirements you might consider:
db support for Zope, http://zope.org
support for mime-attachements (PDF, Word, etc.)
to be able to fix threading issues through the web (people starting a new subject by reply'ing on a previous post, fix threading for mailreaders that don't support proper "In-Reply-To" threading.
Unix mbox output (based on the db). That would make it easy to upgrade, or to change to a different archiver.
support for a 'view complete thread' (this would really be nice!)
python based, because that would make it fit better with mailman and will make it easier to install
easy to use api for adding new messages (e.g. use it as an archiver for wiki discussions, such as http://zwiki.org/GeneralDiscussion )
be able to override message-view class (e.g. so that wiki's can add wiki linking or similar features to the messages)
Regards,
PieterB
-- If your next pot of chili tastes better, it probably is because of something left out, rather than added.
On Mon, 2003-10-27 at 17:08, PieterB wrote:
Also have a look at the "SMART Archiver" project, http://sourceforge.net/projects/smartarchiver/
A replacement for the standard GNU Mailman archiver that supports attachments, searching, date selection, message editing and more, requires a database such as postgressql (mysql support is coming in a future version)
I didn't know about that one!
FWIW, I think all this competition in replacement archives is a good thing. What I really want though, is a standard interface/API/protocol between Mailman and the archives. Here's why:
When Mailman decorates a message for copying to the list, I want to be able to include a link to the archived message in the footer. The problem is that there is little or no connection between the process doing the decoration and the process doing the archiving, and in fact the message may be posted to the list long before the archiver gets a crack at it.
So I don't want to have to ask the archiver for that url. I want Mailman to be able to calculate it from something unique in the message, and have the archiver agree on the algorithm, so that it (or some other translation layer) can do the mapping back to the archived article. Or, Mailman should be able to calculate a unique id for the article and stuff that in a header for the archiver to index on.
-Barry
On Mon, Oct 27, 2003 at 05:28:50PM -0500, Barry Warsaw wrote:
Also have a look at the "SMART Archiver" project, http://sourceforge.net/projects/smartarchiver/ I didn't know about that one!
It's a similar university project of the Eindhoven University of Technology. The project has just been finished and I assume all sources are/will be available. I saw the author upload the code to sf.net, and probably our host gewis.nl will host a demo environment in a couple of weeks.
About coupling the archiver/mailinglist:
So I don't want to have to ask the archiver for that url. I want Mailman to be able to calculate it from something unique in the message, and have the archiver agree on the algorithm, so that it (or some other translation layer) can do the mapping back to the archived article. Or, Mailman should be able to calculate a unique id for the article and stuff that in a header for the archiver to index on.
Zwiki has implemented such functionality based on the time that the message is received/sent. E.g. a mailout for a webpost at the http://zwiki.org/GeneralDiscussion looks like this in the e-mail: (look at the generated signature, with a hyperlink to the message anchor)
There is a discussion on the mailman-developers list on the requirements of an archiver: See: http://news.gmane.org/gmane.mail.mailman.devel or my post at: http://article.gmane.org/gmane.mail.mailman.devel/14954
forwarded from http://zwiki.org/GeneralDiscussion#msg20031027142214-0800@zwiki.org
Off course, in this case the msgid, doesn't have to be shared between the archiver and mailinglist, because zwiki does both in one application.
Regards,
Pieter
cc: mailman-developers lists, zwiki GeneralDiscussion
-- When a broken appliance is demonstrated for the repairman, it will work perfectly.
On Mon, 2003-10-27 at 17:46, PieterB wrote:
It's a similar university project of the Eindhoven University of Technology. The project has just been finished and I assume all sources are/will be available. I saw the author upload the code to sf.net, and probably our host gewis.nl will host a demo environment in a couple of weeks.
Cool!
About coupling the archiver/mailinglist:
So I don't want to have to ask the archiver for that url. I want Mailman to be able to calculate it from something unique in the message, and have the archiver agree on the algorithm, so that it (or some other translation layer) can do the mapping back to the archived article. Or, Mailman should be able to calculate a unique id for the article and stuff that in a header for the archiver to index on.
Zwiki has implemented such functionality based on the time that the message is received/sent. E.g. a mailout for a webpost at the http://zwiki.org/GeneralDiscussion looks like this in the e-mail: (look at the generated signature, with a hyperlink to the message anchor)
There is a discussion on the mailman-developers list on the requirements of an archiver: See: http://news.gmane.org/gmane.mail.mailman.devel or my post at: http://article.gmane.org/gmane.mail.mailman.devel/14954
forwarded from http://zwiki.org/GeneralDiscussion#msg20031027142214-0800@zwiki.org
Off course, in this case the msgid, doesn't have to be shared between the archiver and mailinglist, because zwiki does both in one application.
That's not bad (probably better than the sha hexdigest I'm usually so fond of :), but yep we need to agree on a way to pass that information to the archiver. Mailman does add a unique header specifying the time of arrival, but I suggest a special X- header that Mailman can insert and the archiver can read.
Anybody know of any prior art here?
-Barry
Barry Warsaw wrote:
trim
FWIW, I think all this competition in replacement archives is a good thing. What I really want though, is a standard interface/API/protocol between Mailman and the archives. Here's why:
trim
Yes!!! An API lets us have choices.
What does it take to get the standard interface/API/protocol?
Gary
PieterB wrote:
I think they are all quite basic. I think personally 10 and 12 are 'should-haves' in my opinion.
I may have underestimated the time it will take me to implement them, but I am quite inexperienced in projects of this nature. The closest I have done is some ASP.NET and VB.NET front ends to MS SQL databases as part of my summer job. If I get to the stage where I can implement more than those requirements and I have the time, then I may.
Also have a look at the "SMART Archiver" project, http://sourceforge.net/projects/smartarchiver/
A replacement for the standard GNU Mailman archiver that supports attachments, searching, date selection, message editing and more, requires a database such as postgressql (mysql support is coming in a future version)
I will evaluate this as part of my candidate re-use components analysis in my report. I was hoping that I would be the first to actually put together a new archiver for Mailman, oh well.
Other requirements you might consider:
- db support for Zope, http://zope.org
I may do this depending on my design decisions.
- Unix mbox output (based on the db). That would make it easy to upgrade, or to change to a different archiver.
If achieve my non-functional requirements this should be fairly straightford to implement.
- support for a 'view complete thread' (this would really be nice!)
I'm not sure I understand what you mean by this. The type of interface I am aiming for can be seen in Ka-Ping Yee's Zest prototype at http://www.zesty.ca/zest
- python based, because that would make it fit better with mailman and will make it easier to install
Definately.
- easy to use api for adding new messages (e.g. use it as an archiver for wiki discussions, such as http://zwiki.org/GeneralDiscussion )
From what Barry Warsaw has told me, Mailman supports external archivers that provide a command-line client. Not exactly an API, but if I choose to use this then it would be fairly straightforward to adapt the archiver for other uses.
Thanks a lot for your feedback and the SMART Archiver link.
Iain
Hi,
I should add
- be MIME aware.
8'. be I18N.
Cheers,
-- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/
"Iain" == Iain Bapty "[Mailman-Developers] Requirements for a new archiver" Mon, 27 Oct 2003 12:00:57 +0000
Iain> Functional Requirements The archive component should
Iain> 1. store email discussions.
Iain> 2. integrate with Mailman.
Iain> 3. provide a web-based interface to those email-discussions.
Iain> 4. provide an interface that threads discussions by their
Iain> content. (ZEST)
Iain> 5. provide an interface that threads discussions by e-mail
Iain> replies.
Iain> 6. allow for full-text searching of the archives.
Iain> 7. allow for filtering by date, author, and/or topic.
Iain> 8. be MIME aware.
Iain> 9. allow archives to be set as public or private.
Iain> 10. allow posts to be added, deleted, and modified through
Iain> web interface.
Iain> 11. allow archives to be locked to prevent modification.
Iain> 12. allow postings to be emailed.
Iain> 13. allow postings to be referenced externally.
Iain> There are a two reasons I am posting this.
Iain> Is there anything obvious that I have missed?
I hope 13 means that specific (list, range) messages can be retrieved from the archive by mail like Smartlist.
5 might want to include or allow choice to thread by subject or references as well like Gnus.
Most importantly, 11, locking by site admin overriding virtual domain admin overriding list owner must IMHO be a prerequisite to allowing anybody to rewrite history (item 10).
Iain> Which of the functional requirements, 6 to 13, do you feel
Iain> are the most important? (As part of my report I have to
Iain> analyse the requirements captured)
13 (like Smartlist), 9, 6, 7, ..., (11 before 10)
HTH
jam
participants (14)
-
amk@amk.ca
-
Barry Warsaw
-
Brad Knowles
-
Chuq Von Rospach
-
Dale Newfield
-
David Champion
-
Gary Frederick
-
Iain Bapty
-
John A. Martin
-
Kevin McCann
-
Peter C. Norton
-
PieterB
-
Rafael Cordones Marcos
-
Tokio Kikuchi