[Tutor] Optimal solution in dealing with huge databases in python
Eric Walstad
eric at ericwalstad.com
Thu Jan 25 23:36:49 CET 2007
Alan Gauld wrote:
> "Shadab Sayani" <shadabsayani at yahoo.com> wrote
>
>> The data I need to deal with is in 100s of GB.
>> I am using postgresql backend and SQLALCHEMY ORM.
>
> All ORMs will introduce a significant performance hit.
> If you really need high speed, and populating a 100G+ database
> probably is included in that, then you should look at raw SQL.
> In fact for really big data loads most databases provide a
> C api that goes under the SQL, because even SQL is
> relatively slow.
I've had this experience, too. I routinely load a couple million
records from a tab delimited file into our PostgreSQL database with
Python. My import script:
1. iterates over the tab file a line/record at a time
2. does some data validation, some data manipulation
3. writes an SQL file with the massaged data
4. calls psql <dbname> -f <python-generated sql file>
Generating the SQL file in python goes amazingly fast, considering the
amount of validation and manipulation (normalizing addresses, generating
hashes, etc) - about 1.5minutes/million records.
The SQL it generates does the usual DB stuff including dropping indexes
and constraints, COPY FROM stdin, regenerate indexes, replace
constraints. In my experience, psql is MUCH faster at these kinds of
imports than even direct python interaction with psycopg. Adding an ORM
on top of the database connector (psycopg in my case) slows things down
even more.
As a rough metric, the import I described takes a little over 2 hours on
my 2GHz/1GB laptop.
Good luck,
Eric.
More information about the Tutor
mailing list