[Baypiggies] I need some help architecting the big picture
jeff at drinktomi.com
Tue Apr 29 00:17:11 CEST 2008
> Anyway, so I have all this customer-specifc logic, and all these data
> pipelines. How do I pull it together into something an operator would
> want to use? Is the idea of an operator appropriate? I'm pretty sure
> this is an "operations" problem.
The pipeline is a product that development delivers to operations.
Operations maintains and monitors it. You do not want to have
a system where an operator coordinating it on a a permanent basis.
The pipeline should just chug along. Data gets fed in, information
is spit out.
Pushing pipeline development off to "operations" is a sure way of
making your process melt down eventually. You end up with a
system where huge chunks of logic are handled by one group and
huge chunks are handled by another, and nobody actually
understands how the system works.
That said, you'll need an interface so that operations can see
what is happening with the pipeline. They need this to trouble
shoot the pipeline. A simple one may just summarize data from logs.
> Currently, all of the Python scripts take all their settings on the
> command line. I'm thinking that the settings belong in an included
> Makefile that just contains settings. By keeping the Python dumb, I'm
> attempting to follow the "tools, not policy" idea.
> Is there an easier way to share data like database connection
> information between the Makefile and Python other than passing it in
> explicitly via command line arguments?
Command options are just like method arguments. Too many mandatory
ones being passed all over the place are an indication that they need to
be replaced by a single entity. Pass around a reference to a config
Use this config file everywhere. Configuration changes should only be
made in one place. Distributing configuration throughout a pipeline
system is a recipe for long term failure.
A Python file that is sourced is a wonderful config file format. Java
style properties files work too. Simple key-value shell scripts can
be eval'd as Python too. I imagine you already have a config system
for your web front end. Consider re-using that.
Depending upon how many machines you have interacting you
may need a distributed config system. Publishing a file via HTTP
is an easy solution.
> Does the idea of driving everything from Makefiles make sense?
It sounds to me like a horrible hack that will break down when
you start wanting to do recovery and pipeline monitoring.
Consider writing a simple queue management command. It It looks
for work in one bin, calls an external command to process the work,
and then dumps it into the next. The bins can be as simple as
File A1 goes into bin A/pending
A1 is picked up by job A
A/pending/A1 gets moved to A/consuming/A1
A/consuming/A1 is processed to B/producing/B1
A/consuming/A1 is moved to A/consumed/A1
B/producing/B1 is moved to B/pending/B1
Writing such a simple queue manager should be straight forward.
Then your tool chain becomes nothing more than a series of calls
to the managers. Or you could have each queue command
daemonize itself and then poll the queues every so often.
> I'm having a little bit of a problem with testing. I don't have a way
> of testing any Python code that talks to a database because the Python
> scripts are all dumb about how to connect to the database. I'm
> thinking I might need to setup a "pretend" customer with a test
> database to test all of that logic.
Standard unit testing stuff should work. Use mock objects to
stub out the database connection.
I actually do all of my scripts via a little harness that handles
all the generic command line setup. Scripts subclass the
tframe.Framework object (which I'm releasing as soon as
I'm done with the damn book), and the the script body goes in a
run(options, args) method. Testing involves instantiating the
script's Framework class and then poking at it.
- Jeff Younker - jeff at drinktomi.com -
More information about the Baypiggies