[Mailman-Users] distributing Mailman between 2 systems

Mon Jun 5 01:41:24 CEST 2006

On 4 Jun 2006, at 06:41, Jim Popovitch wrote:

> I would like to move the pipermail archives to a different host  
> then the
> main Mailman system.  Specifically for better archive searching
> performance with htdig.  Is this possible?
>
> -Jim P.

How you approach this depends on what you perceive your problem to be  
and what you mean by "better archive searching performance with htdig".

Like Google and other internet search engines, htdig splits the task  
into two parts: index construction and index search.

Index construction does the heavy lifting of scanning the source  
material and squirreling away in its indices a lot of detail of which  
indexed source files contain what. This can be quite a slow process  
especially when a large body of material has to be initially scanned  
and indexed. It is probably best treated as a batch process run a  
times of light load from other work on the system doing it. Depending  
on the material concerned and how you configure htdig this indexing  
may produce very large indices which can come close to being in the  
same order of magnitude of storage size as the raw source material.  
Many lists with large indices can generate demand for much CPU and  
potentially much storage during indexing (and after in the case of  
storage).

On the other hand index searching to produce a list of source files  
that match the search criteria induces a much lower load on the  
system concerned; after all it is just looking up words in pre-built  
search indices.

The problem with this approach is that search indices are never  
completely up-to-the-minute; but consider how often does Google's  
crawler visit your web site. While updating search indices when new  
documents are added to the archive material should be less load- 
inducing than the original construction of the indices, configuring  
cron jobs so that htdig rebuilds it indices too frequently is not  
advisable. The updating of indices can still involve a lot of IO as  
htdig walks a lot of files to determine which of the existing  
material has been changed as well as what has been added as new.

So first you should define what problem you are trying to solve as  
regards to using htdig before deciding what to do next.

You could plan on having your HTML mail archives integrated with  
Mailman e.g. using pipermail or a pipermail/MHonArc synthesis for the  
archive pages and having htdig integrated with that; I know you are  
aware of the patches available to support this approach and that  
there are some benefits as regard archive privacy being maintained  
and such. I will deal with this integrated approach first. You could  
deploy multiple processors to address the issues by using NFS to  
share the mailman archive storage space between them.

Paranthetically, I successfully ran Mailman on x86 Linux boxes  
entirely out of NFS mounted storage on enterprise level servers for a  
number years, primarily to provide for rapid-ish switchover to a  
backup server in the case of primary Mailman server hardware failure,  
which happened on several occasions. At the time I found that I had  
to limit NFS read/write transfer sizes on the Linux boxes to avoid  
problems in the Linux kernel locking associated with the NFS  
implementation then available. Nowadays I am running Mailman on  
Solaris 10 which has no such problems but I guess the Linux' NFS  
implementation has also improved in the meantime.

The simplest split you could consider is moving the htdig  
installation and workload to a separate machine. The Mailman/htdig  
integration patches support this configuration in conjunction with  
NFS sharing of the Mailman archives files if you look at the  
documentation here:

http://www.openinfo.co.uk/mm/patches/444884/install.html#rconfig

This configuration leaves one machine running Mailman and being  
responsible for providing access to archive material while a second  
machine does htdig's index maintenance. Mailman also "subcontracts"  
each index search requested by a user to the htdig machine but the  
URLs returned in the search results mean that the Mailman machines  
delivers the material from the archives, not the htdig machine.

The question you asked was how to move the pipermail archives to  
another system. Using NFS again, it might be possible to run some of  
Mailman's qrunners on one machine and others (for example, the  
archive runner) on a second to partition things but I have never had  
the time or energy to set up systems to explore the issues of such a  
configuration but somebody else may have pushed the envelope this  
way. As an aside, I would avoid like the plague NFS cross-mounting of  
volumes between machines in any configuration.

If you decide none of the above is appropriate to what you want to  
achieve and the way you want to achieve it then you may be asking the  
wrong question in my view. Maybe you should deploying a mailing list  
archiving system independent of Mailman and you could do worse than  
look at the model set by http://www.mail-archive.com, as a starting  
point.

-----------------------------------------------------------------------
Richard Barrett                               http://www.openinfo.co.uk