[Baypiggies] I need some help architecting the big picture
ben at groovie.org
Mon Apr 28 23:57:19 CEST 2008
On Apr 28, 2008, at 1:46 PM, Shannon -jj Behrens wrote:
> My current course of action is to:
> * Create a global Makefile that knows how to do system-wide tasks.
> * Create a customer-specific Makefile for each customer.
> * The customer-specific Makefiles all "include" a shared Makefile. I
> modeled this after FreeBSD's ports system.
> Hence, the customer-specific Makefiles have some customer-specific
> logic in them, but they can share code via the shared Makefile that
> they all include.
> * Currently, all of the Python scripts take all their settings on the
> command line. I'm thinking that the settings belong in an included
> Makefile that just contains settings. By keeping the Python dumb, I'm
> attempting to follow the "tools, not policy" idea.
> I'm having a little bit of a problem with testing. I don't have a way
> of testing any Python code that talks to a database because the Python
> scripts are all dumb about how to connect to the database. I'm
> thinking I might need to setup a "pretend" customer with a test
> database to test all of that logic.
> Does the idea of driving everything from Makefiles make sense?
My initial thought given both what you're doing now, and what you want
to be able to do "One day", is that you essentially have a data
warehouse type operation with data processing flows. There is data
that comes in, goes through a workflow process of some sort where
processing occurs, at which point you get your TSV files that you load
into a db presumably for said customer to then retrieve at some later
Some things you will probably need shortly when you do go to full
automation (as it sounds rather manual right now):
- Error reports when a processing step has failed
- Possible retries of a processing step
- Ability to scale out processing so that you can add more boxes easily
While something like Amazon SQS works pretty well for queuing
purposes, I could envision using something like Brad Fitzpatrick's
) reliable job queue system instead (and I believe it has a few more
features that'd be important for you). Unfortunately I have yet to see
a Python version of TheSchwartz, and there's a fully network based
version in the works I believe.
If you had such a queue system, I'd think of it like this:
- Customer submits data for processing (whether in a manual file, web
- Data is sent to MogileFS (or some redundant and distributed FS like
- A job is setup in the queue for processing to begin
- Worker processes (likely written in Python, and using the shell
tools you talked of) take the job, perform their step, put the
processed data back in MogileFS, replace their job in the queue with
the next step to perform
(This step repeats as necessary until processing pipeline is done)
- Web Interface can query db to see if job was finished, pull data
For testing, it should be pretty easy to test the worker processes
individually to see that each job function can be completed properly.
The db would also include processing steps and workflow in a table. So
you could add a customer and designate a workflow as a series of job
tasks in the order they should be performed. This also follows what
Paul McNett was saying about keeping the settings in one place. It
also provides a single point to report on the progress of the job
through their workflow, whether each step completed properly, and
failure messages if it didn't.
Since you have a data processing pipeline, which is more workflow
oriented, and where its quite likely you may want to scale processing
out, the Makefile approach seems like a mismatch to the task at hand.
> Is there an easier way to share data like database connection
> information between the Makefile and Python other than passing it in
> explicitly via command line arguments?
I'd keep it with the rest of the settings in a global customer data
db. Separate from each customers db.
> Is there anything that makes more sense than a bunch of
> customer-specific Makefiles that include a global Makefile?
Hopefully what I proposed. :)
> How do I get new batches of data into the system? Do I just put the
> files in the right place and let the Makefiles take it from there?
With a queue system, you just feed new data in via the web UI, or
manually, and fire a job request into the queue, then sit back and
wait for the processed data to be available.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 2472 bytes
Desc: not available
More information about the Baypiggies