PEP 249 (database api) -- executemany() with iterable?
John Nagle
nagle at animats.com
Wed Oct 13 13:03:24 EDT 2010
On 10/12/2010 6:01 PM, Lawrence D'Oliveiro wrote:
> In message<4cb4ba4e$0$1641$742ec2ed at news.sonic.net>, John Nagle wrote:
>
>> In general, if you find yourself making millions of
>> SQL database requests in a loop, you're doing it wrong.
>
> I’ve done this. Not millions, but certainly on the order of tens of
> thousands.
It's a scaling issue. Millions of INSERT or UPDATE requests can
take hours. That's when you need the speedup of bulk loading.
>> Big database loads are usually done by creating a text file
>> with the desired data, then using a LOAD DATA INFILE command.
>
> May not always be flexible enough.
True; you can't do anything with LOAD DATA INFILE but load data.
If you need selects or joins within inserts, you may have to do it
the long way.
>> This (in MySQL) is tens to hundreds of times faster than doing individual
>> INSERT or UPDATE commands.
>
> Why should that be? The database cannot read a text file any faster than I
> can.
Because the indices are updated in one big update, rather than
after each change to the database.
Also note that there are some issues with doing a huge volume of
updates in one MySQL InnoDB transaction. The system has to keep the
data needed to undo the updates, and there's a limit on the amount of
pending transaction history that can be stored.
It's common to load data into a new, empty table, then, once
the reload has succeeded, do a RENAME like CURRENT->OLD, NEW->CURRENT.
Rename of multiple databases is atomic and interlocked with other
operations, so you can replace an entire table on a live server.
I have some bulk databases which are updated from external
sources. The PhishTank database is updated with UPDATE statements
every three hours. But the Open Directory database is updated by
downloading a big gzipped file of XML, creating a new database
table, then renaming. That load takes hours, once a week.
(All this applies to MySQL, and to some extent, Postgres.
If you're using SQLite, it's different. But a million records
is big for SQLite, which is, after all, a "lite" database.
At 10,000 records, you don't have to worry about any of this
stuff. At 1,000,000 records, you do.)
John Nagle
More information about the Python-list
mailing list