PEP 249 (database api) -- executemany() with iterable?

John Nagle nagle at animats.com
Wed Oct 13 13:03:24 EDT 2010


On 10/12/2010 6:01 PM, Lawrence D'Oliveiro wrote:
> In message<4cb4ba4e$0$1641$742ec2ed at news.sonic.net>, John Nagle wrote:
>
>> In general, if you find yourself making millions of
>> SQL database requests in a loop, you're doing it wrong.
>
> I’ve done this. Not millions, but certainly on the order of tens of
> thousands.

    It's a scaling issue.  Millions of INSERT or UPDATE requests can
take hours.  That's when you need the speedup of bulk loading.

>>       Big database loads are usually done by creating a text file
>> with the desired data, then using a LOAD DATA INFILE command.
>
> May not always be flexible enough.

    True; you can't do anything with LOAD DATA INFILE but load data.
If you need selects or joins within inserts, you may have to do it
the long way.

>> This (in MySQL) is tens to hundreds of times faster than doing individual
>> INSERT or UPDATE commands.
>
> Why should that be? The database cannot read a text file any faster than I
> can.

     Because the indices are updated in one big update, rather than
after each change to the database.

     Also note that there are some issues with doing a huge volume of
updates in one MySQL InnoDB transaction.  The system has to keep the
data needed to undo the updates, and there's a limit on the amount of
pending transaction history that can be stored.

     It's common to load data into a new, empty table, then, once
the reload has succeeded, do a RENAME like CURRENT->OLD, NEW->CURRENT.
Rename of multiple databases is atomic and interlocked with other
operations, so you can replace an entire table on a live server.

     I have some bulk databases which are updated from external
sources.  The PhishTank database is updated with UPDATE statements
every three hours. But the Open Directory database is updated by
downloading a big gzipped file of XML, creating a new database
table, then renaming.  That load takes hours, once a week.

     (All this applies to MySQL, and to some extent, Postgres.
If you're using SQLite, it's different.  But a million records
is big for SQLite, which is, after all, a "lite" database.
At 10,000 records, you don't have to worry about any of this
stuff.  At 1,000,000 records, you do.)

					John Nagle



More information about the Python-list mailing list