> This will form a virtual (or real if you have different machines)
> systolic array with producers feeding consumers that feed
> the summary process, all running concurrently.

Nah, I can't do that. The summary process is expensive, but not nearly as 
expensive as the consuming (10 minutes vs. a few hours), and can't be started 
anyway before the consumers are done.

> You only need to keep the output of the consumers in files if
> you need access to it later for some reason. In your case it sounds
> as if you are only interested in the output of the summary.

Or if the summarizing process requires that it is stored on files. Or if the 
consumers naturally store the data on files. The consumers "produce" several 
gigabytes of data, not enough to make it intractable, but enough to not want 
to load them into RAM or transmit it back.

In case you are wondering what the job is: i'm indexing a lot of documents 
with Xapian. The producer reads the [compressed] documents from the hard 
disk, the consumers process it and index it on they own xapian database. When 
they are finished, I merge the databases (the summarizing) and delete the 
partial DBs. I don't need the DBs to be in memory at any time, and xapian 
works with files anyway. Even if I were to use different machines (not worth 
it for a process that will not run very frequently, except at developing 
time), it would be still cheaper to scp the files.

Now, if I only had a third core available to consume a bit faster ...


