custom data warehouse in python vs. out-of-the-box ETL tool
Martin P. Hellwig
martin.hellwig at dcuktec.org
Wed Sep 23 00:40:15 CEST 2009
> Does anyone have experience building a data warehouse in python? Any
> thoughts on custom vs using an out-of-the-box product like Talend or
> I have an integrated system Dashboard project that I was going to
> build using cross-vendor joins on existing DBs, but I keep hearing
> that a data warehouse is the way to go. e.g. I want to create orders
> and order_items with relations to members (MS Access DB), products
> (flat file) and employees (MySQL).
> Thanks in advance for any tips.
My experience is that if you enjoy hacking around databases and are
proficient in Python, than for small scale solutions it is preferable to
do it yourself. If you need a large scale solutions with advanced
requirements, building it yourself is mostly the only way.
I have build a rather complex datawarehouse system in the past (well
actually more like a centralised synchronisation hub, having input and
reporting databases as satellite clients), shoving data from around 500
databases (typically 5 Gb in size each) spread over the world.
The only commercial solutions I reviewed was Business Objects Data
Integrator and Oracle Warehouse Builder.
These tools where quite flexible and if you already have a license deal
which includes these tools I would definitely recommend to have more
than just a look at it.
If not and you are comfortably with using python to shovel data from A
to B and transform it at the same time (moving relational data
automatically into a EAV model and back again, for example) than
building your own solution will probably save you money and time (as
opposed to learn how to use that ETL tool).
This will require you to have at least interest in the following subjects:
- Authorization, may everybody use all data or should it be limited to a
subset on the data depending on the data?
(My solution was one centralised hub which contains all data but is only
accessible to special 'client' servers strictly maintained by me which
only sync the data relevant to them).
- Authenticity, if you have different values for the same thing, which
one should be considered authoritative and if yes may it be pushed back
to the un-authoritative?
-Synchronisation, you really don't want to push/pull all of the database
content over every x times, so how can you delta it and is there a way
to do this only when the data changes (push vs pull)?
-ATOMIC, how long may the data be out of date and is it allowed to
-Using and maintaining multiple databases, hopefully spread over
multiple systems. I had a server for each production DB, a server that
mirrored that production DB with some added columns per table for
external synchronization purposes and a master synchronisation server
(so in essence all data was copied three times, not very efficient but
good if you like to play it on the safe side).
'If consumed, best digested with added seasoning to own preference.'
More information about the Python-list