[Baypiggies] I need some help architecting the big picture

Mon Apr 28 23:50:10 CEST 2008

Shannon -jj Behrens wrote:
> I need some help architecting the big picture on my current project.
> I'm usually a Web guy, which I understand very well.  However, my
> current project is more batch oriented.  Here are the details:
> 
> * I have a bunch of customers.
> 
> * These customers give me batches of data.  One day, there might be
> cron jobs for collecting this data from them.  One day I might have a
> Web service that listens to updates from them and creates batches.
> However, right now, they're manually giving me big chunks of data.
> 
> * I've built the system in a very UNIXy way right now.  That means
> heavy use of cut, sort, awk, small standalone Python scripts, sh, and
> pipes.  I've followed the advice of "do one thing well" and "tools,
> not policy".

I think this is good. You can swap out, improve, any of the pieces 
without detriment as long as the interface is the same.

> * The data that my customers give me is not uniform.  Different
> customers give me the data in different ways.  Hence, I need some
> customer-specific logic to transform the data into a common format
> before I do the rest of the data pipeline.

Does the data structure from a given customer stay consistent? If a 
batch is inconsistent with that customer's "standard", can you bounce it 
back to them or must your toolchain adapt?

> * After a bunch of data crunching, I end up with a bunch of different
> TSV files (tab-separated format) containing different things, which I
> end up loading into a database.

And at this point, is the data in a common format for all customers? 
IOW, is the database schema consistent for all customers?

> * There's a separate database for each customer.

Fine.

> * The database is used to implement a Web service.  This part makes sense to me.
> 
> * I'm making heavy use of testing using nose.
> 
> Anyway, so I have all this customer-specifc logic, and all these data
> pipelines.  How do I pull it together into something an operator would
> want to use?  Is the idea of an operator appropriate?  I'm pretty sure
> this is an "operations" problem.

Is this operator of the intelligent variety, or some temp worker with 
Excel experience? Where in the process does this operator sit? Does 
he/she receive the batches from the customers and then feed them to your 
toolchain and verify that the batches made it to the database, or 
something else entirely?

> My current course of action is to:
> 
> * Create a global Makefile that knows how to do system-wide tasks.
> 
> * Create a customer-specific Makefile for each customer.
> 
> * The customer-specific Makefiles all "include" a shared Makefile.  I
> modeled this after FreeBSD's ports system.

The Makefile strategy sounds very sane, easy to manage once set up. Easy 
to boilerplate for new customers, etc. Well, maybe not "easy", but 
straightforward and understandable.

> Hence, the customer-specific Makefiles have some customer-specific
> logic in them, but they can share code via the shared Makefile that
> they all include.
> 
> * Currently, all of the Python scripts take all their settings on the
> command line.  I'm thinking that the settings belong in an included
> Makefile that just contains settings.  By keeping the Python dumb, I'm
> attempting to follow the "tools, not policy" idea.

The settings should ultimately come from once place. This one place 
could be a text file, a database entry, a part of the customer's 
Makefile, or the operator could get prompted for some or all of the 
information. The scripts taking the arguments on the command line is 
fine. Each link in the chain just passes that information forward.

> I'm having a little bit of a problem with testing.  I don't have a way
> of testing any Python code that talks to a database because the Python
> scripts are all dumb about how to connect to the database.  I'm
> thinking I might need to setup a "pretend" customer with a test
> database to test all of that logic.

I think keeping the scripts "dumb" is good, but why do your tests need 
to be dumb, too? If you are testing interaction between your script and 
the database, then test that by sending the database connection 
parameters. A test or reference customer is a good idea.

> Does the idea of driving everything from Makefiles make sense?

I think it makes a lot of sense. Your pipeline may seem complex on one 
level by having so many little parts, but this is good keeping each 
function in the pipeline separate and well-oiled.

> Is there an easier way to share data like database connection
> information between the Makefile and Python other than passing it in
> explicitly via command line arguments?

Is this question still from a testing perspective? Make a test database, 
and set the test database parameters in the global makefile, and send 
the connection arguments to your python scripts just as you have it now. 
Then each customer-specific makefile will have its own overridden 
connection parameters.

Set up a test database, perhaps copied from some reference customer's 
data, to use in your testing.

Why does it seem like it is a problem to be passing this information on 
the command line?

> Is there anything that makes more sense than a bunch of
> customer-specific Makefiles that include a global Makefile?

Are you benefiting in some other way by not making this a pure-Python 
project? Not knowing more, I think I'd try to use a Python subclassing 
strategy instead of makefiles, and Python modules instead of Unix 
applications, but it is basically the same in the end.

> How do I get new batches of data into the system?  Do I just put the
> files in the right place and let the Makefiles take it from there?

The files are "data", and once a batch is processed you don't need that 
chunk of data anymore, unless you want to archive it. So just have an 
in_data file that the Makefile can feed your toolchain with.

Perhaps your operator needs a very thin GUI frontend for this to feed 
the input batch into the in_data file and start the make. So, the 
operator gets a data chunk from XZY Corp. They just select that customer 
from a list and paste in the contents of the data, and press "submit". 
And wait for a green "okay" or a red "uh-oh".

> Am I completely smoking, or am I on the right track?

Is any of this implemented yet, or is it still in pure design phase? It 
sounds like you have it implemented and you are going crazy dealing with 
the various inputs from the various customers, and perhaps you are 
wondering how to scale this to even more customers, and how to train an 
operator/operators to intelligently wield this toolchain you've built.

It sounds like a fun project! :)

Paul