[Baypiggies] I need some help architecting the big picture

Mon Apr 28 22:46:40 CEST 2008

Hi,

I need some help architecting the big picture on my current project.
I'm usually a Web guy, which I understand very well.  However, my
current project is more batch oriented.  Here are the details:

* I have a bunch of customers.

* These customers give me batches of data.  One day, there might be
cron jobs for collecting this data from them.  One day I might have a
Web service that listens to updates from them and creates batches.
However, right now, they're manually giving me big chunks of data.

* I've built the system in a very UNIXy way right now.  That means
heavy use of cut, sort, awk, small standalone Python scripts, sh, and
pipes.  I've followed the advice of "do one thing well" and "tools,
not policy".

* The data that my customers give me is not uniform.  Different
customers give me the data in different ways.  Hence, I need some
customer-specific logic to transform the data into a common format
before I do the rest of the data pipeline.

* After a bunch of data crunching, I end up with a bunch of different
TSV files (tab-separated format) containing different things, which I
end up loading into a database.

* There's a separate database for each customer.

* The database is used to implement a Web service.  This part makes sense to me.

* I'm making heavy use of testing using nose.

Anyway, so I have all this customer-specifc logic, and all these data
pipelines.  How do I pull it together into something an operator would
want to use?  Is the idea of an operator appropriate?  I'm pretty sure
this is an "operations" problem.

My current course of action is to:

* Create a global Makefile that knows how to do system-wide tasks.

* Create a customer-specific Makefile for each customer.

* The customer-specific Makefiles all "include" a shared Makefile.  I
modeled this after FreeBSD's ports system.

Hence, the customer-specific Makefiles have some customer-specific
logic in them, but they can share code via the shared Makefile that
they all include.

* Currently, all of the Python scripts take all their settings on the
command line.  I'm thinking that the settings belong in an included
Makefile that just contains settings.  By keeping the Python dumb, I'm
attempting to follow the "tools, not policy" idea.

I'm having a little bit of a problem with testing.  I don't have a way
of testing any Python code that talks to a database because the Python
scripts are all dumb about how to connect to the database.  I'm
thinking I might need to setup a "pretend" customer with a test
database to test all of that logic.

Does the idea of driving everything from Makefiles make sense?

Is there an easier way to share data like database connection
information between the Makefile and Python other than passing it in
explicitly via command line arguments?

Is there anything that makes more sense than a bunch of
customer-specific Makefiles that include a global Makefile?

How do I get new batches of data into the system?  Do I just put the
files in the right place and let the Makefiles take it from there?

Am I completely smoking, or am I on the right track?

Thanks,
-jj

-- 
I, for one, welcome our new Facebook overlords!
http://jjinux.blogspot.com/