Speaking about kitties (or archivers)

Meeow miaou*
We spoke on IRC about the archiver the other day and I said that I should present here my thoughts about it. So here they are (beware that might be long).
First I think we should think about the structure/architecture of things. We have a number of component which need to be archives aware, without being exhaustive I'm thinking about: threads...)
- the archiver itself (which present the archive (ie: mails and threads)
- the NNTP bits which should be able to return emails and/or threads
- the stats module which want to give information to the user about the health of the list itself (emails/month, last threads, biggest
- archives retrieval (we probably want to give the user a way to download the archives since the creation of the list/the last year/month)
All of these components needs to be aware about the archives. We agreed that the core does not want to know about it.
So we have several solutions:
- each module becomes an "archiver" wrt to core, meaning each module has its own way to storing the archives (and eventually its own system to do so)
- we create a archive-core module which manage the archives and provides an API to access, modify, extend them.
Of course, we prefer the second solution :) So we would have the following architecture:
mm-core (handles the lists themselves) --send emails to archivers--> archive-core (store the emails and expose them through an API) --> archivers/stats/NNTP
The questions are then:
- how do we store the emails ?
- how do we expose the API ?
- how to make it such that it becomes easy to extend ? (ie: the stats module wants to read the db, but probably also to store information on it)
Having played with mongodb (HK relies on it atm), I quite like the possibilities it gives us. We can easily store the emails in it, query them and since it is a NoSQL database system extending it becomes also easy. On the other hand, having the archiver-core relying on the same system as the core itself would be nicer from a sysadmin pov. I have not tried to upload archives to a RDBMS and test its speed, but for mongodb the results of the tests are presented at [1].
The challenge will be speed and designing an API which allow each component to do its work. I think it would be nice if we could reach some kind of agreement before the GSoC starts (even if we change our mind later on) to be sure that if we get students their work don't overlap too much.
The second point I want to present is with respect to the archiver itself. At the moment we have HyperKitty (HK), the current version:
- exposes single emails
- exposes single threads
- presents the archives for one month or day
- allows to search the archives using the sender, subject, content or subject and content
- presents a summary of the recent activities on the list (including the evolution of the number of post sent over the last month)
I think these are the basis functionality that we would like to see in an archiver. But HK aims at much more, the ultimate goal of HK is to provide a "forum-like" interface to the mailing-lists, with it HK would provide a number of option (social-web like) allowing to "like" or "dislike" a post or a thread, allowing to "+1" someone, allowing to tag the mails or assign them categories. These are all nice feature but, imho, they go beyond what one would want from a basic archiver.
So what I would like to propose is to split HK into a sub-project (MiniKitty?) which would provide these basic functionality.
We would keep HyperKitty as a more extensive archiver and try to bring HK to its ultimate goal. This will need some more work and time as we will have to make HK speak with core for authentication, find a way to send emails to core/the lists and of course add all the other features (tags, categories...)
Comments welcome :)
Thanks, Pierre
[1] http://blog.pingoured.fr/index.php?post/2012/03/16/Mailman-archives-and-mong...
- Hi everyone

Thanks for posting this Pierre-Yves!
On Apr 23, 2012, at 08:17 PM, Pierre-Yves Chibon wrote:
mm-core (handles the lists themselves) --send emails to archivers-->
Note that the core doesn't *have* to send an email to the archiver. From the
core's perspective, the IArchiver
interface has three functions:
- add a message to the archive
- get a 'permalink' to the message in the archive
- get the url to the "top" of the list's archive
The important things are 1) calculating the 'permalink' should not require a round-trip with the archiver; 2) the details of adding a message to the archiver are irrelevant to the core.
For external archivers, such as M-A or Gmane, the implementation of IArchiver may indeed send an email. For a local archiver like MHonArch, the implementation just shells out to a command. For HK or anything else, it could be anything. Every archiver needs a way to get messages sent to it, and the core can adapt to any of those.
Sharing is good, but it's also important to remember that any specific system may or may not have a local archiver. I could certainly imagine a site that only archives on M-A or Gmane and doesn't waste the space to archive locally.
I think we've pretty much come to agreement that the core itself doesn't need a full copy of all the messages after it's sent them, but of course, the "prototype" archiver could be used to keep a local copy of everything in a maildir. That could be shared at the lower level (maildir) or through some kind of API in minikitty.
I think the archiver should *definitely* have a REST API for programmatic access to its messages and data.
I think it would be fine for a basic archiver to be essentially feature-equivalent to Pipermail, with two caveats:
- Truly stable URLs, so that when you regenerate the archive from the raw maildir, none of your links break.
- Search.
Other than that, it's all gravy (as we say :). Nice-to-have features like CSS for customizing the look and feel, dynamic rendering of raw messages, etc. would be cool, but IMHO of secondary importance.
Cheers, -Barry

On Tue, Apr 24, 2012 at 7:20 AM, Barry Warsaw <barry@list.org> wrote:
Maybe it would be better to call that the archive's "index", "directory", or "table of contents". The archive may not be hierarchically organized.
Yes, yes, Yes, YES, yesyesyesyesyes! I mean, FTW. ;-)
That's not an appropriate question. The archive backend will decide that, and will provide an IArchive function that can be registered with the core and with front ends.
It would be nice if an IArchive-compatible archive provided a way for new frontends to discover it, but I guess that's kinda bootstrappy -- if we have that, then why don't we just serve the results over that channel?
No storing, please. The stats module can keep its own db if it wants to, and should be using on-line algorithms in any case so the expense of hitting the archive should be minimal.
I don't like the idea of having a "minikitty". As is probably apparent (and I apologize for that, my opinion there is really irrelevant) I am not a fan of turning ML archives into a social network. However, I think Pingou and other HyperKitty worker should just do whatever it is they want to do, and do it right. If you really want a solid base set of functionality and only then extensions, maybe a plugin architecture would be the way to go. Or you can specify and implement that base first, then add the extensions. (But the mockup already sports UI for the extensions!)
But if that's not really what you want to do, Clearsilver provides a perfectly good base set for us, and I'll be happy to maintain the GPL3-ed distro-in-the-Mailman-distro if that's how it needs to be. *You* do what makes *you* happy.
On the other hand, having the archiver-core relying on the same system as the core itself would be nicer from a sysadmin pov.
IMHO, premature optimization. Among other things, there isn't going to be a "the" archiver-core. Mailman should provide "a" archiver-core, and I think it should be based on maildir (which is apparently Barry's intuition, too). Theory and implementation of maildir are simple and robust, and that allows us to concentrate on the archiver interface.
The challenge will be speed
IMHO, Mailman should not take responsibility for speed of any archiver backend distributed with Mailman. It just needs to provide a robust storage, and the two points Barry mentions above.

On Mon, Apr 23, 2012 at 06:20:18PM -0400, Barry Warsaw wrote:
Ive been thinking about this and I'm in mild disagreement. I think that a mailing list system should give people an archive-store which is acessible behind a generalized API. That may be a non-local archiver if it's still possible to implement the API. That archiver-store should be pluggable (the storage could be SQL, mongodb, or remote) but having the store be accessbile is important.
The store may be accessible via a REST API but I'm not certain that its the correct level to deal with when talking about it in this contect. The current mailman3 doesn't have an API for plugging in archivers via REST... it has an API for plugging in archivers via python. That may be the correct level to be looking at this.
Now the important part -- why an archive store is more integral than the current architecture makes it out to be...
One way to look at this is conceptually. Mailman2 is what I've come to think of as a complete mailing list system. By contrast mailman3-core is only a mailing list manager. Mailman3 contains the information necessary to send messages to an address and have those message disseminated to a wider audience. By itself, this is just fancy management of email aliases. Mailing lists seem to be something more than this. In addition to being management of where email is sent, they're also repositories of knowledge on a particular subject. This is the role filled by archives.
One could also look at it from a sysadmin standpoint. If a sysadmin wants to deploy mailman3 with archives. And wants to have a forum-like interface, an nntp interface, a standard archives interface, and a REST interface to the archives are they going to want to set up for different storage technologies for those, import the generic archives into all four of those, and then maintain and update the storage technologies to keep them safe and secure? Will they want to buy warrantied storage for all of them? I think that theyll be happier if the design of our system could consolidate those.
A different way to look at this is from a programmers standpoint. Many of the interfaces to archives that were talking about are going to share common needs. They need access to the email messages. They need to know how the email messages thread together. They're going to want to search the messages. Under the current scheme, programmers will be creating very similar code to access the email messages in their particular store even if they all choose to use the same underlying storage technology.
At the beginning I said that I was only in mild disagreement... where's the qualifier come in? I think that what we have with mailman3 right now is something like this:
[mailman3 core] -- maintainance of the list metadata, sending and receiving provides a REST API [Web UIs] -- web ui to the Core functions [Archivers] -- mailing list storage and user interface to those stored messages.
I think we should look into something a little more symmetrical:
[mailman3 core] -- maintainance of list metadata, sending and receiving, provides a REST API [Web UIs] -- web ui to Core functions [Archive-stores] -- stores the messages sent to the mailing lists. Provides a (REST?) API to apps built on top of it [Archiver UIs] -- web ui, nntp interface, REST API (if not implemented at the storage layer), etc to the archive-store
By splitting the archive storage from the archive UI similar to how mailman3-core splits with the web ui, we can allow a sysadmin to choose one archive-storage for all of the archive front-ends that they run on their systems.
Question: Why have multiple stores? The big reason is that archives are being much more rapidly developed right now. So I anticipate that people are going to be working on different storage technology with different tradeoffs. One storage might be faster. Another might be more generally available. We'll have to reexamine this in the future. It's possible that we'll find one storage system that is perfect for all cases. It's also possible that we'll find all storage solutions have tradeoffs in which case we'll likely want to support third-party stores forever.
Question: This is all dangling off of the archiver interface for mailman3 anyway so how can we affect the outcome? Well, in some ways people can create anything they want in there so we cant enforce a solution. However, if we think that it's desirable, we can certainly document this (maybe with an interface if we go the python route for that layer of API or with a specification of what the REST API should look like for that.) We can also enhance our current archivers to provide the API that we come up with. I have a feeling that the prototype archiver with maildir will be a little slow but if it provides the API and comments about separation between core, storage, and archive UI it gives people a starting point to creating their own.
Question: Where do we start? I think that we'll either succeed or fail very quickly by trying to define what the API between archive-store and archiver-ui should look like. We'll either be able to agree on a common set of features there (from which we'll be able to go forth and create our own archive-storage plugins) or we'll decide that we all need/want to do different things that no common API can address. If there's no common API definition then we won't be able to do any of the rest of this so there won't be any sense continuing down that path.
-Toshio

On Tue, 2012-04-24 at 11:12 -0700, Toshio Kuratomi wrote:
Thank you Toshio for explaining this in a better than I was able to do.
The current version of HK relies on mongodb for the storage, but I want to test HK with a traditionnal SQL backend. So I have started to work on this.
The interface I defined is there: https://github.com/pypingou/kittystore/blob/master/kittystore/__init__.py
And its implementation using SQLAlchemy is there: https://github.com/pypingou/kittystore/blob/master/kittystore/kittysastore.p...
The mongodb implementation isn't done yet but should be quite trivial to do (most function from the API were coming from it).
The idea is that now, we can have different backend and each module needing access to the emails can use the API directly without having to bother about which storage system is behind.
I hope this helps,
Pierre

Resent as the first one doesn't seem to want to arrive.
On Tue, 2012-04-24 at 11:12 -0700, Toshio Kuratomi wrote:
Thank you Toshio for explaining this in a better than I was able to do.
The current version of HK relies on mongodb for the storage, but I want to test HK with a traditionnal SQL backend. So I have started to work on this.
The interface I defined is there: https://github.com/pypingou/kittystore/blob/master/kittystore/__init__.py
And its implementation using SQLAlchemy is there: https://github.com/pypingou/kittystore/blob/master/kittystore/kittysastore.p...
The mongodb implementation isn't done yet but should be quite trivial to do (most function from the API were coming from it).
The idea is that now, we can have different backend and each module needing access to the emails can use the API directly without having to bother about which storage system is behind.
I hope this helps,
Pierre

On Thu, Apr 26, 2012 at 06:36:02PM +0200, Pierre-Yves Chibon wrote:
Wacky looked at this today and asked if we should have the x-message-id-hash as another key value to look up an email. That seems proper to me. Would we want a separate function or to overload get_email() so that it can either take message_id or message_id_hash?
-Toshio

On May 31, 2012, at 10:28 AM, Toshio Kuratomi wrote:
Wacky looked at this today and asked if we should have the x-message-id-hash as another key value to look up an email. That seems proper to me.
+1
Would we want a separate function or to overload get_email() so that it can either take message_id or message_id_hash?
They are separate methods in the IMessageStore API.
-Barry

On Apr 26, 2012, at 06:36 PM, Pierre-Yves Chibon wrote:
Follow on thoughts to my previous message.
Let's say you hate the default prototype archiver because mbox is too slow. Further, let's say you have an amazing implementation of the backend message store based on mongodb. How does that fit into my previous picture?
Actually quite easily I think. As long as you can expose insertion into the archiver with the IArchiver interface, and extraction from the archiver via the IMessageStore interface, these two bits can replace the default implementations, just by changing how the ZCA maps the interfaces to implementations. This mongodb-based message-storage-core can even live outside the core process, and *still* be available to in-process Python code, or exposed by the core in its REST API. It would just take a little extra IPC hidden behind the implementations of those two APIs.
Cheers, -Barry

On Apr 24, 2012, at 11:12 AM, Toshio Kuratomi wrote:
I'm warming up to this.
The IArchiver interface is generic enough to support both internal and external archivers. If there are deficiencies in either, we can fix the API, as long as both use cases are supportable, in a manner similar to IArchiver.permalink() returning None if the archiver doesn't support stable urls.
(A known omission from the current IArchiver API is that there's no way to access attachments. Does anybody have good ideas about that?)
From a systems perspective, yes. Archivers must be enabled system-wide via the config file, but I think we should allow individual lists to opt-in or -out of system-enabled archivers. I'm on the fence as to what to do about the prototype archiver, which is beginning to seem much more like the default archiver-core, i.e. sans ui.
This is compelling.
I always envisioned the core's storage being splittable into three main partitions. One would be the list-centric data, another would be the user-centric data, and the third would be the message-centric data. If you look carefully for example, you'll see that there are no direct foreign key references between members and the mailing lists they're associated with. This link is by fqdn listname, *not* mailinglist table ids. This is deliberate.
(It's entirely possible the implementation doesn't actually allow these three partitions to be stored in completely separate places. I'd consider that a bug.)
OTOH, I don't think it makes sense for the core to rely on more than one ORM. For now, that's Storm.
(I'm slightly lying here because the technology that shows the most promise for supporting schema migrations is Alembic which is based on a stripped down version of SQLAlchemy. But migrations are probably a completely off-line operation.)
Some IArchiver implementations will be purely external archivers. I like that we can have a Mail Archive implementation, or potentially a Gmane implementation. Those are very different from a MHonArc implementation, which is again different from the prototype (default? built-in? always-enabled?) archiver. Having a common API for all of these simplifies the parts of the core that send messages to the archives, but what happens once the data is inserted into the different archivers is another question.
Remember too that archiver speed is less important, since that doesn't live in the critical path for message delivery. There is a handle that basically copies the message to the archiver queue, and there's a separate runner that dequeues those messages and sends them off to the individual archivers, via the IArchive interface. So I think the performance of message insertion isn't something we should worry about for now.
Places to start:
Look at the IArchiver interface and try to figure out whether it's complete from a message-insertion POV. Maybe in that case, we don't care about attachments since the archiver will do whatever it wants with them.
Look at the IMessageStore API. Is this complete? IOW, could you build a purely Python-level archiver like HyperKitty on top of this API? Here's where proper attachment handling would probably be necessary.
How would you want to expose the IMessageStore interface into the REST API? My sense is that you could probably take a fairly straightforward translation of IMessageStore into REST and *that* would be what you'd build the various archiver UIs on top of. REST needs to answer questions like batching which are necessary for efficient transfer of data over HTTP but not for direct Python calls.
Should threading information be part of the IMessageStore, or a separate interface? If the prototype archiver becomes the default implementation for the IMessageStore, it probably needs to grow a lot more functionality to support threading information.
The way I'm seeing it is that IArchiver is the interface for getting messages *into* the IMessageStore. The IMessageStore is the interface for making Python level queries needed to get the raw messages out of the system, and a REST API is how you publish this data for the various ui consumers.
Cheers, -Barry

On Fri, 2012-06-01 at 21:33 -0400, Barry Warsaw wrote:
I have defined a number of functions in the kittystore interface which could go into the IMessageStore: https://github.com/pypingou/kittystore/blob/master/kittystore/__init__.py
Main difference compare to what is in the IMessageStore at the moment is that I require the fq list name as first argument of each function (ie: I have one table for each list and need to know where you would like me to search)
+1 for the straightforward translation of the interface to REST. Archiver could then use either the implementation of the interface or REST.
Keeping some info about threading in the database, the KittyStore interface allows to retrieve for example all the messages of a thread or the length of a thread.
I can work on merging KittyStore and IMessageStore and providing the REST interface for it. Should I provide also an implementation ? Using PostgreSQL as backend ?
Pierre

On Jun 05, 2012, at 08:09 AM, Pierre-Yves Chibon wrote:
Point of order: it's much easier to deal with branches and merge proposals (even for works-in-progress) than it is patches in a mailing list thread. :)
In any case, a few comments.
Why do we need both generic versions of add/delete/get and list-centric versions of those methods? When a message comes into the system, how would we know which to call, or do we call them both?
If we decide to keep just the list-centric versions, then it's probably better to take an IMailingList object as the first parameter, and use that to get the fqdn_listname if necessary.
Another way of handling search might be to accept keyword arguments, e.g.
def search(mlist, **kws)
then the keys of kws could be headers, with values being the search term you're looking for. You could define something like _body as the key for searching the body (the entire plain text? one of the attachments?).
If we don't need the list-centric versions, using __len__() would be better than get_list_size().
Cheers, -Barry

Thanks for posting this Pierre-Yves!
On Apr 23, 2012, at 08:17 PM, Pierre-Yves Chibon wrote:
mm-core (handles the lists themselves) --send emails to archivers-->
Note that the core doesn't *have* to send an email to the archiver. From the
core's perspective, the IArchiver
interface has three functions:
- add a message to the archive
- get a 'permalink' to the message in the archive
- get the url to the "top" of the list's archive
The important things are 1) calculating the 'permalink' should not require a round-trip with the archiver; 2) the details of adding a message to the archiver are irrelevant to the core.
For external archivers, such as M-A or Gmane, the implementation of IArchiver may indeed send an email. For a local archiver like MHonArch, the implementation just shells out to a command. For HK or anything else, it could be anything. Every archiver needs a way to get messages sent to it, and the core can adapt to any of those.
Sharing is good, but it's also important to remember that any specific system may or may not have a local archiver. I could certainly imagine a site that only archives on M-A or Gmane and doesn't waste the space to archive locally.
I think we've pretty much come to agreement that the core itself doesn't need a full copy of all the messages after it's sent them, but of course, the "prototype" archiver could be used to keep a local copy of everything in a maildir. That could be shared at the lower level (maildir) or through some kind of API in minikitty.
I think the archiver should *definitely* have a REST API for programmatic access to its messages and data.
I think it would be fine for a basic archiver to be essentially feature-equivalent to Pipermail, with two caveats:
- Truly stable URLs, so that when you regenerate the archive from the raw maildir, none of your links break.
- Search.
Other than that, it's all gravy (as we say :). Nice-to-have features like CSS for customizing the look and feel, dynamic rendering of raw messages, etc. would be cool, but IMHO of secondary importance.
Cheers, -Barry

On Tue, Apr 24, 2012 at 7:20 AM, Barry Warsaw <barry@list.org> wrote:
Maybe it would be better to call that the archive's "index", "directory", or "table of contents". The archive may not be hierarchically organized.
Yes, yes, Yes, YES, yesyesyesyesyes! I mean, FTW. ;-)
That's not an appropriate question. The archive backend will decide that, and will provide an IArchive function that can be registered with the core and with front ends.
It would be nice if an IArchive-compatible archive provided a way for new frontends to discover it, but I guess that's kinda bootstrappy -- if we have that, then why don't we just serve the results over that channel?
No storing, please. The stats module can keep its own db if it wants to, and should be using on-line algorithms in any case so the expense of hitting the archive should be minimal.
I don't like the idea of having a "minikitty". As is probably apparent (and I apologize for that, my opinion there is really irrelevant) I am not a fan of turning ML archives into a social network. However, I think Pingou and other HyperKitty worker should just do whatever it is they want to do, and do it right. If you really want a solid base set of functionality and only then extensions, maybe a plugin architecture would be the way to go. Or you can specify and implement that base first, then add the extensions. (But the mockup already sports UI for the extensions!)
But if that's not really what you want to do, Clearsilver provides a perfectly good base set for us, and I'll be happy to maintain the GPL3-ed distro-in-the-Mailman-distro if that's how it needs to be. *You* do what makes *you* happy.
On the other hand, having the archiver-core relying on the same system as the core itself would be nicer from a sysadmin pov.
IMHO, premature optimization. Among other things, there isn't going to be a "the" archiver-core. Mailman should provide "a" archiver-core, and I think it should be based on maildir (which is apparently Barry's intuition, too). Theory and implementation of maildir are simple and robust, and that allows us to concentrate on the archiver interface.
The challenge will be speed
IMHO, Mailman should not take responsibility for speed of any archiver backend distributed with Mailman. It just needs to provide a robust storage, and the two points Barry mentions above.

On Mon, Apr 23, 2012 at 06:20:18PM -0400, Barry Warsaw wrote:
Ive been thinking about this and I'm in mild disagreement. I think that a mailing list system should give people an archive-store which is acessible behind a generalized API. That may be a non-local archiver if it's still possible to implement the API. That archiver-store should be pluggable (the storage could be SQL, mongodb, or remote) but having the store be accessbile is important.
The store may be accessible via a REST API but I'm not certain that its the correct level to deal with when talking about it in this contect. The current mailman3 doesn't have an API for plugging in archivers via REST... it has an API for plugging in archivers via python. That may be the correct level to be looking at this.
Now the important part -- why an archive store is more integral than the current architecture makes it out to be...
One way to look at this is conceptually. Mailman2 is what I've come to think of as a complete mailing list system. By contrast mailman3-core is only a mailing list manager. Mailman3 contains the information necessary to send messages to an address and have those message disseminated to a wider audience. By itself, this is just fancy management of email aliases. Mailing lists seem to be something more than this. In addition to being management of where email is sent, they're also repositories of knowledge on a particular subject. This is the role filled by archives.
One could also look at it from a sysadmin standpoint. If a sysadmin wants to deploy mailman3 with archives. And wants to have a forum-like interface, an nntp interface, a standard archives interface, and a REST interface to the archives are they going to want to set up for different storage technologies for those, import the generic archives into all four of those, and then maintain and update the storage technologies to keep them safe and secure? Will they want to buy warrantied storage for all of them? I think that theyll be happier if the design of our system could consolidate those.
A different way to look at this is from a programmers standpoint. Many of the interfaces to archives that were talking about are going to share common needs. They need access to the email messages. They need to know how the email messages thread together. They're going to want to search the messages. Under the current scheme, programmers will be creating very similar code to access the email messages in their particular store even if they all choose to use the same underlying storage technology.
At the beginning I said that I was only in mild disagreement... where's the qualifier come in? I think that what we have with mailman3 right now is something like this:
[mailman3 core] -- maintainance of the list metadata, sending and receiving provides a REST API [Web UIs] -- web ui to the Core functions [Archivers] -- mailing list storage and user interface to those stored messages.
I think we should look into something a little more symmetrical:
[mailman3 core] -- maintainance of list metadata, sending and receiving, provides a REST API [Web UIs] -- web ui to Core functions [Archive-stores] -- stores the messages sent to the mailing lists. Provides a (REST?) API to apps built on top of it [Archiver UIs] -- web ui, nntp interface, REST API (if not implemented at the storage layer), etc to the archive-store
By splitting the archive storage from the archive UI similar to how mailman3-core splits with the web ui, we can allow a sysadmin to choose one archive-storage for all of the archive front-ends that they run on their systems.
Question: Why have multiple stores? The big reason is that archives are being much more rapidly developed right now. So I anticipate that people are going to be working on different storage technology with different tradeoffs. One storage might be faster. Another might be more generally available. We'll have to reexamine this in the future. It's possible that we'll find one storage system that is perfect for all cases. It's also possible that we'll find all storage solutions have tradeoffs in which case we'll likely want to support third-party stores forever.
Question: This is all dangling off of the archiver interface for mailman3 anyway so how can we affect the outcome? Well, in some ways people can create anything they want in there so we cant enforce a solution. However, if we think that it's desirable, we can certainly document this (maybe with an interface if we go the python route for that layer of API or with a specification of what the REST API should look like for that.) We can also enhance our current archivers to provide the API that we come up with. I have a feeling that the prototype archiver with maildir will be a little slow but if it provides the API and comments about separation between core, storage, and archive UI it gives people a starting point to creating their own.
Question: Where do we start? I think that we'll either succeed or fail very quickly by trying to define what the API between archive-store and archiver-ui should look like. We'll either be able to agree on a common set of features there (from which we'll be able to go forth and create our own archive-storage plugins) or we'll decide that we all need/want to do different things that no common API can address. If there's no common API definition then we won't be able to do any of the rest of this so there won't be any sense continuing down that path.
-Toshio

On Tue, 2012-04-24 at 11:12 -0700, Toshio Kuratomi wrote:
Thank you Toshio for explaining this in a better than I was able to do.
The current version of HK relies on mongodb for the storage, but I want to test HK with a traditionnal SQL backend. So I have started to work on this.
The interface I defined is there: https://github.com/pypingou/kittystore/blob/master/kittystore/__init__.py
And its implementation using SQLAlchemy is there: https://github.com/pypingou/kittystore/blob/master/kittystore/kittysastore.p...
The mongodb implementation isn't done yet but should be quite trivial to do (most function from the API were coming from it).
The idea is that now, we can have different backend and each module needing access to the emails can use the API directly without having to bother about which storage system is behind.
I hope this helps,
Pierre

Resent as the first one doesn't seem to want to arrive.
On Tue, 2012-04-24 at 11:12 -0700, Toshio Kuratomi wrote:
Thank you Toshio for explaining this in a better than I was able to do.
The current version of HK relies on mongodb for the storage, but I want to test HK with a traditionnal SQL backend. So I have started to work on this.
The interface I defined is there: https://github.com/pypingou/kittystore/blob/master/kittystore/__init__.py
And its implementation using SQLAlchemy is there: https://github.com/pypingou/kittystore/blob/master/kittystore/kittysastore.p...
The mongodb implementation isn't done yet but should be quite trivial to do (most function from the API were coming from it).
The idea is that now, we can have different backend and each module needing access to the emails can use the API directly without having to bother about which storage system is behind.
I hope this helps,
Pierre

On Thu, Apr 26, 2012 at 06:36:02PM +0200, Pierre-Yves Chibon wrote:
Wacky looked at this today and asked if we should have the x-message-id-hash as another key value to look up an email. That seems proper to me. Would we want a separate function or to overload get_email() so that it can either take message_id or message_id_hash?
-Toshio

On May 31, 2012, at 10:28 AM, Toshio Kuratomi wrote:
Wacky looked at this today and asked if we should have the x-message-id-hash as another key value to look up an email. That seems proper to me.
+1
Would we want a separate function or to overload get_email() so that it can either take message_id or message_id_hash?
They are separate methods in the IMessageStore API.
-Barry

On Apr 26, 2012, at 06:36 PM, Pierre-Yves Chibon wrote:
Follow on thoughts to my previous message.
Let's say you hate the default prototype archiver because mbox is too slow. Further, let's say you have an amazing implementation of the backend message store based on mongodb. How does that fit into my previous picture?
Actually quite easily I think. As long as you can expose insertion into the archiver with the IArchiver interface, and extraction from the archiver via the IMessageStore interface, these two bits can replace the default implementations, just by changing how the ZCA maps the interfaces to implementations. This mongodb-based message-storage-core can even live outside the core process, and *still* be available to in-process Python code, or exposed by the core in its REST API. It would just take a little extra IPC hidden behind the implementations of those two APIs.
Cheers, -Barry

On Apr 24, 2012, at 11:12 AM, Toshio Kuratomi wrote:
I'm warming up to this.
The IArchiver interface is generic enough to support both internal and external archivers. If there are deficiencies in either, we can fix the API, as long as both use cases are supportable, in a manner similar to IArchiver.permalink() returning None if the archiver doesn't support stable urls.
(A known omission from the current IArchiver API is that there's no way to access attachments. Does anybody have good ideas about that?)
From a systems perspective, yes. Archivers must be enabled system-wide via the config file, but I think we should allow individual lists to opt-in or -out of system-enabled archivers. I'm on the fence as to what to do about the prototype archiver, which is beginning to seem much more like the default archiver-core, i.e. sans ui.
This is compelling.
I always envisioned the core's storage being splittable into three main partitions. One would be the list-centric data, another would be the user-centric data, and the third would be the message-centric data. If you look carefully for example, you'll see that there are no direct foreign key references between members and the mailing lists they're associated with. This link is by fqdn listname, *not* mailinglist table ids. This is deliberate.
(It's entirely possible the implementation doesn't actually allow these three partitions to be stored in completely separate places. I'd consider that a bug.)
OTOH, I don't think it makes sense for the core to rely on more than one ORM. For now, that's Storm.
(I'm slightly lying here because the technology that shows the most promise for supporting schema migrations is Alembic which is based on a stripped down version of SQLAlchemy. But migrations are probably a completely off-line operation.)
Some IArchiver implementations will be purely external archivers. I like that we can have a Mail Archive implementation, or potentially a Gmane implementation. Those are very different from a MHonArc implementation, which is again different from the prototype (default? built-in? always-enabled?) archiver. Having a common API for all of these simplifies the parts of the core that send messages to the archives, but what happens once the data is inserted into the different archivers is another question.
Remember too that archiver speed is less important, since that doesn't live in the critical path for message delivery. There is a handle that basically copies the message to the archiver queue, and there's a separate runner that dequeues those messages and sends them off to the individual archivers, via the IArchive interface. So I think the performance of message insertion isn't something we should worry about for now.
Places to start:
Look at the IArchiver interface and try to figure out whether it's complete from a message-insertion POV. Maybe in that case, we don't care about attachments since the archiver will do whatever it wants with them.
Look at the IMessageStore API. Is this complete? IOW, could you build a purely Python-level archiver like HyperKitty on top of this API? Here's where proper attachment handling would probably be necessary.
How would you want to expose the IMessageStore interface into the REST API? My sense is that you could probably take a fairly straightforward translation of IMessageStore into REST and *that* would be what you'd build the various archiver UIs on top of. REST needs to answer questions like batching which are necessary for efficient transfer of data over HTTP but not for direct Python calls.
Should threading information be part of the IMessageStore, or a separate interface? If the prototype archiver becomes the default implementation for the IMessageStore, it probably needs to grow a lot more functionality to support threading information.
The way I'm seeing it is that IArchiver is the interface for getting messages *into* the IMessageStore. The IMessageStore is the interface for making Python level queries needed to get the raw messages out of the system, and a REST API is how you publish this data for the various ui consumers.
Cheers, -Barry

On Fri, 2012-06-01 at 21:33 -0400, Barry Warsaw wrote:
I have defined a number of functions in the kittystore interface which could go into the IMessageStore: https://github.com/pypingou/kittystore/blob/master/kittystore/__init__.py
Main difference compare to what is in the IMessageStore at the moment is that I require the fq list name as first argument of each function (ie: I have one table for each list and need to know where you would like me to search)
+1 for the straightforward translation of the interface to REST. Archiver could then use either the implementation of the interface or REST.
Keeping some info about threading in the database, the KittyStore interface allows to retrieve for example all the messages of a thread or the length of a thread.
I can work on merging KittyStore and IMessageStore and providing the REST interface for it. Should I provide also an implementation ? Using PostgreSQL as backend ?
Pierre

On Jun 05, 2012, at 08:09 AM, Pierre-Yves Chibon wrote:
Point of order: it's much easier to deal with branches and merge proposals (even for works-in-progress) than it is patches in a mailing list thread. :)
In any case, a few comments.
Why do we need both generic versions of add/delete/get and list-centric versions of those methods? When a message comes into the system, how would we know which to call, or do we call them both?
If we decide to keep just the list-centric versions, then it's probably better to take an IMailingList object as the first parameter, and use that to get the fqdn_listname if necessary.
Another way of handling search might be to accept keyword arguments, e.g.
def search(mlist, **kws)
then the keys of kws could be headers, with values being the search term you're looking for. You could define something like _body as the key for searching the body (the entire plain text? one of the attachments?).
If we don't need the list-centric versions, using __len__() would be better than get_list_size().
Cheers, -Barry
participants (5)
-
Barry Warsaw
-
Pierre-Yves Chibon
-
Stephen J. Turnbull
-
Terri Oda
-
Toshio Kuratomi