[Baypiggies] I need some help architecting the big picture

Tue Apr 29 00:16:04 CEST 2008

A couple thoughts on how I might approach this problem, actually this is 
the kind of thing I am doing.

I use configObject to manage my control files, for each customer there 
would be a config file which can of course get common params from some 
global config file which can always be used or by including a directive 
in the customer config file.  This tool which is an extention of 
configparser, it allows hierarchical  sections, lets you enter values as 
lists and returns everything as python objects (dicts for the most part).

I would then have a main driver program that initializes logging, and 
gets the commandline specified config file. You could have a section in 
the config specifying workflow in terms of py modules and classes that 
could be dynamically instantiated and in turn they could get any 
varriable info from the same config object.

You can then implement fairly generic modules to retrieve and parse the 
customer data, and have options to save or move the data to some archive 
location. Some modules will be common the ones toward the end of the 
pipline since they will typically see data in a known format and only 
need to know how to wrap it, assign ownership...

In my system after extracting and doing some transformation on the data, 
I store the transformed data in some pending directory and pass a 
referance to a queue that the next phase waits on, you could also think 
about passing operations and data using messaging, this would require 
various components to each be running and listening for request...

I think you will find using configObject instead of make files will make 
life simpler especially if some non technical person is to be given the 
task of generating these. A gui could help here or you could use the 
validation facilities included.

max

Shannon -jj Behrens wrote:
> Hi,
>
> I need some help architecting the big picture on my current project.
> I'm usually a Web guy, which I understand very well.  However, my
> current project is more batch oriented.  Here are the details:
>
> * I have a bunch of customers.
>
> * These customers give me batches of data.  One day, there might be
> cron jobs for collecting this data from them.  One day I might have a
> Web service that listens to updates from them and creates batches.
> However, right now, they're manually giving me big chunks of data.
>
> * I've built the system in a very UNIXy way right now.  That means
> heavy use of cut, sort, awk, small standalone Python scripts, sh, and
> pipes.  I've followed the advice of "do one thing well" and "tools,
> not policy".
>
> * The data that my customers give me is not uniform.  Different
> customers give me the data in different ways.  Hence, I need some
> customer-specific logic to transform the data into a common format
> before I do the rest of the data pipeline.
>
> * After a bunch of data crunching, I end up with a bunch of different
> TSV files (tab-separated format) containing different things, which I
> end up loading into a database.
>
> * There's a separate database for each customer.
>
> * The database is used to implement a Web service.  This part makes sense to me.
>
> * I'm making heavy use of testing using nose.
>
> Anyway, so I have all this customer-specifc logic, and all these data
> pipelines.  How do I pull it together into something an operator would
> want to use?  Is the idea of an operator appropriate?  I'm pretty sure
> this is an "operations" problem.
>
> My current course of action is to:
>
> * Create a global Makefile that knows how to do system-wide tasks.
>
> * Create a customer-specific Makefile for each customer.
>
> * The customer-specific Makefiles all "include" a shared Makefile.  I
> modeled this after FreeBSD's ports system.
>
> Hence, the customer-specific Makefiles have some customer-specific
> logic in them, but they can share code via the shared Makefile that
> they all include.
>
> * Currently, all of the Python scripts take all their settings on the
> command line.  I'm thinking that the settings belong in an included
> Makefile that just contains settings.  By keeping the Python dumb, I'm
> attempting to follow the "tools, not policy" idea.
>
> I'm having a little bit of a problem with testing.  I don't have a way
> of testing any Python code that talks to a database because the Python
> scripts are all dumb about how to connect to the database.  I'm
> thinking I might need to setup a "pretend" customer with a test
> database to test all of that logic.
>
> Does the idea of driving everything from Makefiles make sense?
>
> Is there an easier way to share data like database connection
> information between the Makefile and Python other than passing it in
> explicitly via command line arguments?
>
> Is there anything that makes more sense than a bunch of
> customer-specific Makefiles that include a global Makefile?
>
> How do I get new batches of data into the system?  Do I just put the
> files in the right place and let the Makefiles take it from there?
>
> Am I completely smoking, or am I on the right track?
>
> Thanks,
> -jj
>
>