[Tutor] Optimal solution in dealing with huge databases inpython

Thu Jan 25 18:23:48 CET 2007

"Shadab Sayani" <shadabsayani at yahoo.com> wrote 

> Thank you very much for immediate response.I didnt get
> the point of loading the data using SQL.What does that
> mean?

It means writing a SQL file that can then be run from 
the database SQL prompt. I don't know what that means 
in PostGres terms since I use Oracle mainly. But in Oracle 
it would involve typing

$ sqlplus loadfile.sql

And sqlplus would execute the SQL commands in loadfile.sql.

Loadfile.sql would in turn contain lots of SQL commands like:

INSERT into CUSTOMER 
VALUES ( 'fred', 'bloggs', 45, '01773-987321');

One thing to note is that due to caching issues you might find 
it works better if you keep the individual loader files fairly small 
- say 1000 or so records each. On other databases (DB2 for 
example) very big data files seem to be faster, it just depends on 
how the internal SQL engine works.

> Does it mean Bulk file load.If so then there will
> I/O performance hit right.

There will aleays be a big I/O load if you are loading data 
into a database, at some stage it must get onto the disk!

> Writing the parsed tokens in the csv format in a file.

I would use Python to translate the CSV data into SQL 
command like the INSERT statement above.

> Do have any idea about the C api for Postgresql and
> some documentation to use it?

Nope. I don't use Postgres.
There are usually two levels of API, the one that 
allows you to write SQL commands from C. That's not 
much advantage over the Pytthon DBAPI, and the one 
that bypasses SQL to allow you to access the raw data 
on the disk, thats the fast one. But it does require a good 
knowledge of how the underlying data structuires are 
stored within the files. Oracle, DB2, Informix etc will have 
that documented somlewhere but open source databases 
may require you to read the source! As I say its fast, 
but very, very risky.

Alan G.