[Tutor] Optimal solution in dealing with huge databases in python

Alan Gauld alan.gauld at btinternet.com
Thu Jan 25 10:56:34 CET 2007

"Shadab Sayani" <shadabsayani at yahoo.com> wrote 

> The data I need to deal with is  in 100s of GB.
> I am using postgresql backend and SQLALCHEMY ORM.

All ORMs will introduce a significant performance hit.
If you really need high speed, and populating a 100G+ database 
probably is included in that, then you should look at raw SQL.
In fact for really big data loads most databases provide a 
C api that goes under the SQL, because even SQL is 
relatively slow.

As an example, we use a large Oracle database at work.
Loading about 3 Terabytes of data through an ORM took 
over 3 months! Loading it through SQL took about 3 days.
Loading it through the C API took less than a day.

Your mileage may vary since a lot depends on locks, 
indexes etc etc. And of course the server spec!

But for loading large data volumes get as close to the metal 
as possible. Once the data is loaded you can use the 
ORM to simplify the application code for extracting and 
modifying the data.

> I need  to read the bio datafiles and parse them and 
> then store them in  database.

Parsing them and preparing the SQL statements can 
be done in Python. But the actual loading I suggest 
should be done in SQL if possible (The C aPI should 
be a last resort - its frought with danger!)

> Please suggest some viable solution to handle such 
> enormous data from  python.

A few hundred gigabytes is not too enormous these days
but you are never going to achieve times of less than hours.
You do need to be realistic about that. And if you are using
a standard PC spec server instead of a large multi CPU 
box with SCSI/RAID disc arrays etc then you could be 
looking at days.

The other important factor is your data schema. The more 
tables, joins, indexes etc the database has to maintain the 
more work it takes and the sloewer it gets. The 3TB example 
I gave had over 2000 tables, so it was always going to be 
slow. If you have a single unindexed table then it will be 
much simpler. (But the queries later will be much harder!)

Alan Gauld
Author of the Learn to Program web site

More information about the Tutor mailing list