Alternative MemberAdaptors
This message is primary for those of you writing or using alternative MemberAdaptor implementations (e.g. SQL).
Yesterday I checked in a working implementation of a member adaptor based on BerkeleyDB. It seems to greatly improve the memory footprint for really huge lists, at the cost of greater administrative overhead (because you have to know how to setup and manage BerkeleyDBs), and potentially slower performance (I haven't benchmarked it; this is just based on my experience with BerkeleyDB in general).
I've found a few things during this experience that point to things we ought to improve. I don't have a lot of time right now, but I wanted to put this out there to start the discussion. I'll quickly mention a few things.
I wanted to hook into the BDB transaction (txn) machinery, and I found a convenient hook. I overloaded MailList.Lock() to include a txn begin, MailList.Save() to do a txn commit, and MailList.Unlock() to do a txn abort. This seems to work well as long as aborting after committing is harmless (it is in BDB). I'd like to get feedback from the SQL folks (or other MemberAdaptor developers) on whether we need more explicit transaction support or whether the basically necessary hooks are already there.
To make this work, however, I found I had to change the order of when the extend.py hook gets run. Specifically, I needed it to run /before/ the list is locked in MailList.__init__(), otherwise locking contructors don't hook into the machinery. I want to commit this change but I don't want to break other MemberAdaptors or extend.py hooks.
We really need to optimize the MemberAdaptor API and the implementations that use them. Especially methods that return lists, e.g. getMembers() and friends. Right now, everything has to return a list, but I could do much better by returning iterators, because I can load my iterator up with a BDB cursor. This has the advantage of not requiring the entire member database to be loaded into memory just to iterate over it. Unfortunately, too much of the rest of the code assumes these methods return lists, and while I started to go down the iterator path, I backed out of it because of the complexity.
There are other optimizations that would require a bit more thought. E.g. the admin's Membership List page seems to require that the entire member database be iterated over to chunkify and bucketize. Fixing this probably requires both changes to the u/i and changes to the interface. It also makes life more difficult for OldStyleMemberships, although BDBMemberAdaptor can probably be fairly easily elaborated.
I'd like to hear from other member adaptor implementations on their thoughts here.
I'd love for any BerkeleyDB experts to review the BDBMemberAdaptor code, especially in some of the choices I've made for creating and opening the environment. I had a lot of practical problems with this part of the code, especially in getting multiple processes to cooperate reasonably. Any BerkeleyDB experts out there? (I'm fairly happy with the schemas, at least for the current MemberAdaptor API).
I'm leaning heavily toward having this stuff in Mailman 2.2 and /not/ porting it to 2.1.x. Too many changes for a micro release, although it makes project management more complicated, especially in merging fixes back into the 2.1.x maintenance branch. Sigh.
Okay, I'm out of time for today. Any feedback will be appreciated, even if I can't respond immediately.
Also, the BerkeleyDB based member adaptor seems to work, but should be considered experimental. See the BDBMemberAdaptor.py comments for how to hook this up to a mailing list. There is currently no migration tool from classic member adaptors to BDBMemberAdaptors, although I intend to write such a beast and run a few of my personal lists on the code to flesh things out.
Enjoy, -Barry
On Thu, Feb 20, 2003 at 10:55:52AM -0500, barry@python.org wrote:
[ MemberAdaptors ]
I've found a few things during this experience that point to things we ought to improve. I don't have a lot of time right now, but I wanted to put this out there to start the discussion. I'll quickly mention a few things.
I haven't looked at MemberAdaptors in any level of detail yet, but I do intend to write one or two member adaptors for our internal company lists. One is a straight (My)SQL one that takes data from a simple set of tables, one of which is specifically for Mailman (and thus easy to change.)
The other would be a much more complex adaptor that hooks right into our company database ('NSA', PostgreSQL with an OO interface library written in Perl[*]) using XML-RPC. I already use XML-RPC from Python to test and twiddle with so much more ease than from Perl or PHP, so I'm sure that's not going to be an issue. The main reason for using the XML-RPC interface is, however, to be able to access all the email aliasses the list subscribers have. The company-internal lists are strictly controlled, and every day or so someone will post a message from their new funky alias, which will be held. I thought Mailman 2.1 was going to have a mechanism to avoid that (a listing of 'these emailaddresses are also me' in the options page) but that may have been a dream. In any case, I can add that :)
I'm also not sure whether I really want a full MemberAdaptor for the XML-RPC case, or a mix of another backend and XML-RPC. Anyway, neither implementation would be transactional (and the MySQL server is 3.x.)
- I wanted to hook into the BDB transaction (txn) machinery, and I found a convenient hook. I overloaded MailList.Lock() to include a txn begin, MailList.Save() to do a txn commit, and MailList.Unlock() to do a txn abort. This seems to work well as long as aborting after committing is harmless (it is in BDB). I'd like to get feedback from the SQL folks (or other MemberAdaptor developers) on whether we need more explicit transaction support or whether the basically necessary hooks are already there.
Well, I can't really tell without (re-)grokking the code more, but in any case an abort after a commit should not pose a problem; it's just a matter of remembering state. Many SQL implementations won't even care if you do 'BEGIN WORK; <work>; COMMIT; ROLLBACK;' -- they'll give a notice, but not abort anything.
When not doing anything transactional, it gets easier, of course. Maintain all state in the Adaptor, and only commit something to a backend on the Save() :)
- We really need to optimize the MemberAdaptor API and the implementations that use them. Especially methods that return lists, e.g. getMembers() and friends. Right now, everything has to return a list, but I could do much better by returning iterators, because I can load my iterator up with a BDB cursor. This has the advantage of not requiring the entire member database to be loaded into memory just to iterate over it. Unfortunately, too much of the rest of the code assumes these methods return lists, and while I started to go down the iterator path, I backed out of it because of the complexity.
This is highly backend dependent... In the XML-RPC case, you really don't want an XML-RPC call to go out for every list member (especially not if the xmlrpc library doesn't support/use keepalive.) On the other hand, getting the entire list into the adaptor and then returning an iterator to that list to Mailman might be suboptimal if Mailman ever has to decide between playing convenient (a list) and playing nice (an iterator).
There are other optimizations that would require a bit more thought. E.g. the admin's Membership List page seems to require that the entire member database be iterated over to chunkify and bucketize. Fixing this probably requires both changes to the u/i and changes to the interface. It also makes life more difficult for OldStyleMemberships, although BDBMemberAdaptor can probably be fairly easily elaborated.
And in SQL, the optimal way to do this is probably to count the number of entries, and then split it into chunks, get whichever chunk is desired, and the first and last entry of the other chunks. The count can be done entirely in the SQL server, and is generally pretty damned fast. This will definately pay off for very large lists, but it's not really trivial to expose all that logic to the adaptor in a future-proof way.
Maybe-we-should-do-a-Mailman-Sprint-at-PyCon-Barry-ly y'rs :)
[*] 'NSA' is also the reason I'm not as active as I once was... I hate Perl, much more so now than before I actually used it full-time. But, I get to go to PyCon, so I haven't lost my soul completely yet :)
Thomas Wouters <thomas@xs4all.net>
Hi! I'm a .signature virus! copy me into your .signature file to help me spread!
Thomas Wouters wrote:
On Thu, Feb 20, 2003 at 10:55:52AM -0500, barry@python.org wrote:
[ MemberAdaptors ]
FYI, I'm working on a MemberAdaptor for mxODBC and DCOracle2 (which won't be open sourced though).
I've found a few things during this experience that point to things we ought to improve. I don't have a lot of time right now, but I wanted to put this out there to start the discussion. I'll quickly mention a few things.
- I wanted to hook into the BDB transaction (txn) machinery, and I found a convenient hook. I overloaded MailList.Lock() to include a txn begin, MailList.Save() to do a txn commit, and MailList.Unlock() to do a txn abort. This seems to work well as long as aborting after committing is harmless (it is in BDB). I'd like to get feedback from the SQL folks (or other MemberAdaptor developers) on whether we need more explicit transaction support or whether the basically necessary hooks are already there.
This worked for me as well. I did have to use a central database controller, though, in order to maintain transaction state independently of the MemberAdaptor.
- We really need to optimize the MemberAdaptor API and the implementations that use them. Especially methods that return lists, e.g. getMembers() and friends. Right now, everything has to return a list, but I could do much better by returning iterators, because I can load my iterator up with a BDB cursor. This has the advantage of not requiring the entire member database to be loaded into memory just to iterate over it. Unfortunately, too much of the rest of the code assumes these methods return lists, and while I started to go down the iterator path, I backed out of it because of the complexity.
I found that I had to recode the admin.py CGI stuff. Most of the code seems to be built with the idea of having very fast access to all aspects of the member data.
With databases and other external storages this is not the case. It is often better to get all the member data for a chunk of members at once, then calling out to the storage for each and every bit of information.
There are other optimizations that would require a bit more thought. E.g. the admin's Membership List page seems to require that the entire member database be iterated over to chunkify and bucketize. Fixing this probably requires both changes to the u/i and changes to the interface. It also makes life more difficult for OldStyleMemberships, although BDBMemberAdaptor can probably be fairly easily elaborated.
-- Marc-Andre Lemburg eGenix.com
Professional Python Software directly from the Source (#1, Mar 05 2003)
Python/Zope Products & Consulting ... http://www.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
Python UK 2003, Oxford: 27 days left EuroPython 2003, Charleroi, Belgium: 111 days left
participants (3)
-
barry@python.org
-
M.-A. Lemburg
-
Thomas Wouters