[Baypiggies] I need some help architecting the big picture
Shannon -jj Behrens
jjinux at gmail.com
Tue Apr 29 06:57:27 CEST 2008
> Does the data structure from a given customer stay consistent? If a batch
> is inconsistent with that customer's "standard", can you bounce it back to
> them or must your toolchain adapt?
It'll stay consistent. If I do have to adapt, it'll be manually, which is okay.
> > * After a bunch of data crunching, I end up with a bunch of different
> > TSV files (tab-separated format) containing different things, which I
> > end up loading into a database.
> And at this point, is the data in a common format for all customers? IOW,
> is the database schema consistent for all customers?
I've identified one table where there is a customer-specific set of
fields. However, the name of the table is always the same and it
always has an id column. Everything else in the schema is generic.
> > Anyway, so I have all this customer-specifc logic, and all these data
> > pipelines. How do I pull it together into something an operator would
> > want to use? Is the idea of an operator appropriate? I'm pretty sure
> > this is an "operations" problem.
> Is this operator of the intelligent variety, or some temp worker with Excel
I'm shooting for sysadmin level.
> Where in the process does this operator sit? Does he/she receive
> the batches from the customers and then feed them to your toolchain and
> verify that the batches made it to the database, or something else entirely?
That's the problem. I don't know. I've never been in a situation
where my user wasn't on the other side of a Web interface. For the
foreseeable future, the operator will be me or some sysadmin. My
guess is that he'll get the data via scp either manually or by cron
job. Now, I have to figure out how to feed the data to the system.
Do I simply put it in some place and say "go!"? I was guessing
someone else had been in a similar situation and had some best
practices to recommend.
> The Makefile strategy sounds very sane, easy to manage once set up. Easy to
> boilerplate for new customers, etc. Well, maybe not "easy", but
> straightforward and understandable.
Ah, good. So I'm not crazy ;)
> The settings should ultimately come from once place. This one place could
> be a text file, a database entry, a part of the customer's Makefile, or the
> operator could get prompted for some or all of the information. The scripts
> taking the arguments on the command line is fine. Each link in the chain
> just passes that information forward.
Agreed. I'm leaning toward having this stuff in the Makefile.
> > I'm having a little bit of a problem with testing. I don't have a way
> > of testing any Python code that talks to a database because the Python
> > scripts are all dumb about how to connect to the database. I'm
> > thinking I might need to setup a "pretend" customer with a test
> > database to test all of that logic.
> I think keeping the scripts "dumb" is good, but why do your tests need to
> be dumb, too? If you are testing interaction between your script and the
> database, then test that by sending the database connection parameters. A
> test or reference customer is a good idea.
It's just sort of strange because normally when you run nose, you
don't pass any parameters. I guess if I setup a test customer, then I
just need to figure out how to get a few settings out of the Makefile
for the tests. That's manageable.
> > Is there an easier way to share data like database connection
> > information between the Makefile and Python other than passing it in
> > explicitly via command line arguments?
> Is this question still from a testing perspective?
No, it's more general, although testing is currently my largest pain point.
> Make a test database,
> and set the test database parameters in the global makefile, and send the
> connection arguments to your python scripts just as you have it now. Then
> each customer-specific makefile will have its own overridden connection
Ah, very good.
> Set up a test database, perhaps copied from some reference customer's data,
> to use in your testing.
> Why does it seem like it is a problem to be passing this information on the
> command line?
I was just checking that this is the right thing to do.
> > Is there anything that makes more sense than a bunch of
> > customer-specific Makefiles that include a global Makefile?
> Are you benefiting in some other way by not making this a pure-Python
Absolutely. awk, cut, sort, etc. are fast and simple. Anytime I need
something more complex than a one liner, that's when I switch to
> Not knowing more, I think I'd try to use a Python subclassing
> strategy instead of makefiles, and Python modules instead of Unix
> applications, but it is basically the same in the end.
Yes, the subclassing approach is how I'd normally approach it. I've
been reading "The Art of UNIX Programming" lately, and I knew that
this was the perfect situation to be UNIXy.
> > How do I get new batches of data into the system? Do I just put the
> > files in the right place and let the Makefiles take it from there?
> The files are "data", and once a batch is processed you don't need that
> chunk of data anymore, unless you want to archive it. So just have an
> in_data file that the Makefile can feed your toolchain with.
Very good. We're on the same page then. I have an incoming directory
and an archives directory. My plan was to simply drop the incoming
files in the right place--a place that the Makefiles knew about.
However, it just seemed a bit strange since I've never done that
> Perhaps your operator needs a very thin GUI frontend for this to feed the
> input batch into the in_data file and start the make. So, the operator gets
> a data chunk from XZY Corp. They just select that customer from a list and
> paste in the contents of the data, and press "submit". And wait for a green
> "okay" or a red "uh-oh".
My guess is that they'll always be working with large data files. I
could either force them to put the file in the right place, or I could
write yet another shell script that took the file and put it in the
right place. After the batch is processed, I need to also move the
file from the incoming directory to the archive directory. The
transaction nut in me worries about things like that, but perhaps I
> > Am I completely smoking, or am I on the right track?
> Is any of this implemented yet, or is it still in pure design phase?
Most of the small tools are written, and they're working fine. Some
of the Makefile infrastructure is in place. All of the
customer-specific stuff isn't yet. When it comes to "tools, not
policy", the tools are working fine, but I need some policy ;)
> sounds like you have it implemented and you are going crazy dealing with the
> various inputs from the various customers,
I'm not going crazy yet ;)
> and perhaps you are wondering how
> to scale this to even more customers, and how to train an operator/operators
> to intelligently wield this toolchain you've built.
I have proof of concept and a bunch of pretty nice code. Now I gotta
tie it all together in a way that doesn't involve cutting and pasting
commands from a README all the time ;)
> It sounds like a fun project! :)
I, for one, welcome our new Facebook overlords!
More information about the Baypiggies