[spambayes-dev] RE: [Spambayes] Are there plans for a daemonized or compiled versionofSpambayes?

Mon Sep 22 12:31:13 EDT 2003

On Mon, Sep 22, 2003 at 11:08:09AM -0500, Skip Montanaro wrote:
> 
>     Michael> I haven't looked in depth at the Spambayes code. But being the
>     Michael> sysadmin I'm able to look at the processes running on the
>     Michael> system. It appears that, when an email is scanned, multiple
>     Michael> python threads get forked.  Presumably this is because
>     Michael> "hammiefilter.py" runs other *.py scripts, or exec's multiple
>     Michael> pythons. (True? Not true?)
> 
> Not true I don't think.

I think what is being seen is related to the way qmail delivers mail.  
There will be a single qmail-local process spawned to handle each local 
delivery.   The qmail-local process will then spawn processes as directed
by the .qmail file.  So there will be a python running for each and every 
message being delivered at any given time.    

> 
>     Michael> Assuming that's what's happening, I guess I was wondering if it
>     Michael> would be beneficial, in the sense of being less demanding on
>     Michael> system resources, to consolidate all the routines into a single
>     Michael> python thread? Is this feasible and worthwhile?
> 
> I tried it quite awhile ago, but didn't code the front-end client in C, just
> Python.  One problem is that you substitute network overhead for startup
> overhead.  Assuming you maintain the long-running process as a Python
> program, you can try a couple things:

Another problem with a qmail setup using .qmail files is the long-running
process will need to handle multiple concurrent messages.  

> 
>     1 write a front-end client in Python and use a very simple protocol to
>     communicate with the server (maybe a byte count followed by the
>     message).  The server would either spit back the message augmented with
>     the usual scoring headers or just the score information, relying on the
>     client to embellish the message.
> 
>     2 If (and only if) the above isn't fast enough, write the simplest
>     front-end client you can in C to avoid Python startup overhead.
> 
> The first one will give you some idea what you're up against.  Python's
> startup is probably the bottleneck, so I'm skeptical that the first option
> will gain you anything besides an architecture which is simple to experiment
> with.  The Python-based server scores messages very quickly once the startup
> overhead is out of the way.

An alternative would be to create a queue for messages to be classified and
have qmail deliver into the queue.  The long-running processes can poll the
queue and process all the messages in the queue at once with one python.
This will of course add some delay by serializing delivery but it would
certainly decrease the number of concurrent pythons and be less of a PITA
that a service of some sort.

> 
>     Michael> Another thing I'm thinking about doing to mitigate the impact
>     Michael> on resources, is running the hammiefilter in ramdisk.
> 
> It probably won't buy you much, but it's a simple enough thing to try.  Make
> sure you copy your database (pickle or bsddb file) to ramdisk as well.

Agreed, this stuff is likely already in file cache when the server is busy.

-- 

Dave

===============================================================
| <- You must be smarter than this stick to ride
     the Internet		-Mike Handler
===============================================================