[Tutor] Optimal solution in dealing with huge databases in python
Alan Gauld
alan.gauld at btinternet.com
Thu Jan 25 10:56:34 CET 2007
"Shadab Sayani" <shadabsayani at yahoo.com> wrote
> The data I need to deal with is in 100s of GB.
> I am using postgresql backend and SQLALCHEMY ORM.
All ORMs will introduce a significant performance hit.
If you really need high speed, and populating a 100G+ database
probably is included in that, then you should look at raw SQL.
In fact for really big data loads most databases provide a
C api that goes under the SQL, because even SQL is
relatively slow.
As an example, we use a large Oracle database at work.
Loading about 3 Terabytes of data through an ORM took
over 3 months! Loading it through SQL took about 3 days.
Loading it through the C API took less than a day.
Your mileage may vary since a lot depends on locks,
indexes etc etc. And of course the server spec!
But for loading large data volumes get as close to the metal
as possible. Once the data is loaded you can use the
ORM to simplify the application code for extracting and
modifying the data.
> I need to read the bio datafiles and parse them and
> then store them in database.
Parsing them and preparing the SQL statements can
be done in Python. But the actual loading I suggest
should be done in SQL if possible (The C aPI should
be a last resort - its frought with danger!)
> Please suggest some viable solution to handle such
> enormous data from python.
A few hundred gigabytes is not too enormous these days
but you are never going to achieve times of less than hours.
You do need to be realistic about that. And if you are using
a standard PC spec server instead of a large multi CPU
box with SCSI/RAID disc arrays etc then you could be
looking at days.
The other important factor is your data schema. The more
tables, joins, indexes etc the database has to maintain the
more work it takes and the sloewer it gets. The 3TB example
I gave had over 2000 tables, so it was always going to be
slow. If you have a single unindexed table then it will be
much simpler. (But the queries later will be much harder!)
--
Alan Gauld
Author of the Learn to Program web site
http://www.freenetpages.co.uk/hp/alan.gauld
More information about the Tutor
mailing list