[Baypiggies] I need some help architecting the big picture

Mon Apr 28 23:57:19 CEST 2008

On Apr 28, 2008, at 1:46 PM, Shannon -jj Behrens wrote:

> My current course of action is to:
>
> * Create a global Makefile that knows how to do system-wide tasks.
>
> * Create a customer-specific Makefile for each customer.
>
> * The customer-specific Makefiles all "include" a shared Makefile.  I
> modeled this after FreeBSD's ports system.
>
> Hence, the customer-specific Makefiles have some customer-specific
> logic in them, but they can share code via the shared Makefile that
> they all include.
>
> * Currently, all of the Python scripts take all their settings on the
> command line.  I'm thinking that the settings belong in an included
> Makefile that just contains settings.  By keeping the Python dumb, I'm
> attempting to follow the "tools, not policy" idea.
>
> I'm having a little bit of a problem with testing.  I don't have a way
> of testing any Python code that talks to a database because the Python
> scripts are all dumb about how to connect to the database.  I'm
> thinking I might need to setup a "pretend" customer with a test
> database to test all of that logic.
>
> Does the idea of driving everything from Makefiles make sense?

My initial thought given both what you're doing now, and what you want  
to be able to do "One day", is that you essentially have a data  
warehouse type operation with data processing flows. There is data  
that comes in, goes through a workflow process of some sort where  
processing occurs, at which point you get your TSV files that you load  
into a db presumably for said customer to then retrieve at some later  
point.

Some things you will probably need shortly when you do go to full  
automation (as it sounds rather manual right now):
- Error reports when a processing step has failed
- Possible retries of a processing step
- Ability to scale out processing so that you can add more boxes easily

While something like Amazon SQS works pretty well for queuing  
purposes, I could envision using something like Brad Fitzpatrick's  
TheSchwartz (http://search.cpan.org/~bradfitz/TheSchwartz-1.04/lib/TheSchwartz.pm 
) reliable job queue system instead (and I believe it has a few more  
features that'd be important for you). Unfortunately I have yet to see  
a Python version of TheSchwartz, and there's a fully network based  
version in the works I believe.

If you had such a queue system, I'd think of it like this:
- Customer submits data for processing (whether in a manual file, web  
interface, etc)
- Data is sent to MogileFS (or some redundant and distributed FS like  
it)
- A job is setup in the queue for processing to begin
- Worker processes (likely written in Python, and using the shell  
tools you talked of) take the job, perform their step, put the  
processed data back in MogileFS, replace their job in the queue with  
the next step to perform
    (This step repeats as necessary until processing pipeline is done)
- Web Interface can query db to see if job was finished, pull data  
back, etc.

For testing, it should be pretty easy to test the worker processes  
individually to see that each job function can be completed properly.

The db would also include processing steps and workflow in a table. So  
you could add a customer and designate a workflow as a series of job  
tasks in the order they should be performed. This also follows what  
Paul McNett was saying about keeping the settings in one place. It  
also provides a single point to report on the progress of the job  
through their workflow, whether each step completed properly, and  
failure messages if it didn't.

Since you have a data processing pipeline, which is more workflow  
oriented, and where its quite likely you may want to scale processing  
out, the Makefile approach seems like a mismatch to the task at hand.

> Is there an easier way to share data like database connection
> information between the Makefile and Python other than passing it in
> explicitly via command line arguments?

I'd keep it with the rest of the settings in a global customer data  
db. Separate from each customers db.

> Is there anything that makes more sense than a bunch of
> customer-specific Makefiles that include a global Makefile?

Hopefully what I proposed. :)

> How do I get new batches of data into the system?  Do I just put the
> files in the right place and let the Makefiles take it from there?

With a queue system, you just feed new data in via the web UI, or  
manually, and fire a job request into the queue, then sit back and  
wait for the processed data to be available.

Cheers,
Ben
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2472 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20080428/fb9a4b3a/attachment-0001.bin>