[Mailman-Users] distributing Mailman between 2 systems

Mon Jun 5 03:22:00 CEST 2006

Wow! thank you Richard.  Apologies for the top post, but I didn't quite 
know where to jump-in on your comments, and I didn't want to truncate 
any of them either.   Thank you for the detailed info.

My issue is that the Mailman host is quite busy processing in/outbound 
email and providing Mailman web access to listinfo, admin, options, etc. 
(via Apache)  When rundig (via nightly_htdig) runs it consumes almost 
all the host CPU resources (5min loadavg hovers at 11% for 5+ hours). 
This has an adverse affect on response times for Apache and the MTA due 
to file system issues (multiple drives and partitions, but in the end 
still just IDE)

My issue isn't so much that I need to move pipermail or the archives to 
a different host but rather just the indexing of them.  I like your 
mailman+htdig integration for pw protected lists, so splitting Mailman 
up from the archives would probably break that or at least make it 
difficult.

Additionally I would like to run nightly_htdig in a continuous loop on 
the second host, constantly cycling through lists and re-digging each as 
needed.

I know how to tweak htdig.conf so that indexing on one host can return 
htsearch results for another.  I know that NFS could help out, but I'm 
forever concerned about NFS security and network issues because I lease 
external hosts from multiple providers and their will always be wan 
issues between them.

What I am thinking of doing (based on your's and Mark's comments) is to 
rsync archive/private/* to a second host and run nightly_htdig on that 
system.  I can then use Apache config to redirect the htsearch queries 
to the second host, and have the results returned point back at the 
primary Mailman host.  I *think* this will work, but need to test.

-Jim P.

Richard Barrett wrote:
> 
> On 4 Jun 2006, at 06:41, Jim Popovitch wrote:
> 
>> I would like to move the pipermail archives to a different host then the
>> main Mailman system.  Specifically for better archive searching
>> performance with htdig.  Is this possible?
>>
>> -Jim P.
> 
> How you approach this depends on what you perceive your problem to be 
> and what you mean by "better archive searching performance with htdig".
> 
> Like Google and other internet search engines, htdig splits the task 
> into two parts: index construction and index search.
> 
> Index construction does the heavy lifting of scanning the source 
> material and squirreling away in its indices a lot of detail of which 
> indexed source files contain what. This can be quite a slow process 
> especially when a large body of material has to be initially scanned and 
> indexed. It is probably best treated as a batch process run a times of 
> light load from other work on the system doing it. Depending on the 
> material concerned and how you configure htdig this indexing may produce 
> very large indices which can come close to being in the same order of 
> magnitude of storage size as the raw source material. Many lists with 
> large indices can generate demand for much CPU and potentially much 
> storage during indexing (and after in the case of storage).
> 
> On the other hand index searching to produce a list of source files that 
> match the search criteria induces a much lower load on the system 
> concerned; after all it is just looking up words in pre-built search 
> indices.
> 
> The problem with this approach is that search indices are never 
> completely up-to-the-minute; but consider how often does Google's 
> crawler visit your web site. While updating search indices when new 
> documents are added to the archive material should be less load-inducing 
> than the original construction of the indices, configuring cron jobs so 
> that htdig rebuilds it indices too frequently is not advisable. The 
> updating of indices can still involve a lot of IO as htdig walks a lot 
> of files to determine which of the existing material has been changed as 
> well as what has been added as new.
> 
> So first you should define what problem you are trying to solve as 
> regards to using htdig before deciding what to do next.
> 
> You could plan on having your HTML mail archives integrated with Mailman 
> e.g. using pipermail or a pipermail/MHonArc synthesis for the archive 
> pages and having htdig integrated with that; I know you are aware of the 
> patches available to support this approach and that there are some 
> benefits as regard archive privacy being maintained and such. I will 
> deal with this integrated approach first. You could deploy multiple 
> processors to address the issues by using NFS to share the mailman 
> archive storage space between them.
> 
> Paranthetically, I successfully ran Mailman on x86 Linux boxes entirely 
> out of NFS mounted storage on enterprise level servers for a number 
> years, primarily to provide for rapid-ish switchover to a backup server 
> in the case of primary Mailman server hardware failure, which happened 
> on several occasions. At the time I found that I had to limit NFS 
> read/write transfer sizes on the Linux boxes to avoid problems in the 
> Linux kernel locking associated with the NFS implementation then 
> available. Nowadays I am running Mailman on Solaris 10 which has no such 
> problems but I guess the Linux' NFS implementation has also improved in 
> the meantime.
> 
> The simplest split you could consider is moving the htdig installation 
> and workload to a separate machine. The Mailman/htdig integration 
> patches support this configuration in conjunction with NFS sharing of 
> the Mailman archives files if you look at the documentation here:
> 
> http://www.openinfo.co.uk/mm/patches/444884/install.html#rconfig
> 
> This configuration leaves one machine running Mailman and being 
> responsible for providing access to archive material while a second 
> machine does htdig's index maintenance. Mailman also "subcontracts" each 
> index search requested by a user to the htdig machine but the URLs 
> returned in the search results mean that the Mailman machines delivers 
> the material from the archives, not the htdig machine.
> 
> The question you asked was how to move the pipermail archives to another 
> system. Using NFS again, it might be possible to run some of Mailman's 
> qrunners on one machine and others (for example, the archive runner) on 
> a second to partition things but I have never had the time or energy to 
> set up systems to explore the issues of such a configuration but 
> somebody else may have pushed the envelope this way. As an aside, I 
> would avoid like the plague NFS cross-mounting of volumes between 
> machines in any configuration.
> 
> If you decide none of the above is appropriate to what you want to 
> achieve and the way you want to achieve it then you may be asking the 
> wrong question in my view. Maybe you should deploying a mailing list 
> archiving system independent of Mailman and you could do worse than look 
> at the model set by http://www.mail-archive.com, as a starting point.
> 
> -----------------------------------------------------------------------
> Richard Barrett                               http://www.openinfo.co.uk
> 
> 
>