[Baypiggies] I need some help architecting the big picture

Shannon -jj Behrens jjinux at gmail.com
Tue Apr 29 22:38:49 CEST 2008

On Mon, Apr 28, 2008 at 10:34 PM, Shannon -jj Behrens <jjinux at gmail.com> wrote:
> Thanks, everyone for all your advice!  I'm probably even more confused
>  now than I was before, but that's only because I have more options to
>  consider ;)
>  Thanks Again!

I've been trying to digest all the different comments.  A few people
were trying to frame this as a data warehousing or data integration
problem.  I don't know much about those subjects.  However, my naive
guess is that they don't apply for the following reasons:

It's hard to imagine that this is a data warehousing problem since I'm
not trying to warehouse the data.  I have no requirement to store it
forever.  I'm simply trying to "infer" some stuff about the data.
When I'm doing this inferencing, I'm not even working with the data in
a database.  I'm working with it in TSV form since that works out
better (it's faster to stream from file to file than from table to

It's hard to imagine that I have a data integration problem since I'm
not trying to integrate multiple sources of data.  Each customer gives
me chunks of data, and I analyze the data for that customer in order
to infer things.  The only reason I use a database at all is to store
the things I've learned about the data for later retrieval via a Web
service.  Once I analyze the data, I can throw it away or simply
archive it in its raw form.

My understanding of data warehousing is that it's a way of storing
everything you can about data from a bunch of sources and write deep
SQL queries to learn deeper things about the data.  That doesn't
really match my situation.

By the way, my batches don't even take all that long to run.
Currently, it's on the order of 30 seconds.  If I need to start from
scratch if any part fails, it's no big deal.

This conversation has been really helpful.  It's definitely driven one
thing home for me.  One reason "tools, not policy" is so valuable is
because people love to disagree about policy.  However, once I've
written a good tool that's well documented and well tested, no one
disagrees about its usefulness.  Hence, it makes sense for me to focus
on writing good tools.  If I need to switch to a more complicated
policy later involving queues, data warehousing, etc., my small tools
are still going to be helpful.


I, for one, welcome our new Facebook overlords!

More information about the Baypiggies mailing list