PEP 249 (database api) -- executemany() with iterable?

Sat Oct 16 20:02:21 EDT 2010

On Sun, 17 Oct 2010 11:27:17 +1300, Lawrence D'Oliveiro wrote:

> In message <i96squ$aqj$1 at localhost.localdomain>, Martin Gregorie wrote:
> 
>> On Thu, 14 Oct 2010 16:36:34 +1300, Lawrence D'Oliveiro wrote:
>> 
>>> In message <4cb5e659$0$1650$742ec2ed at news.sonic.net>, John Nagle
>>> wrote:
>>> 
>>>>      Also note that there are some issues with doing a huge volume of
>>>> updates in one MySQL InnoDB transaction.  The system has to keep the
>>>> data needed to undo the updates, and there's a limit on the amount of
>>>> pending transaction history that can be stored.
>>> 
>>> How does “load data” avoid this? Is that not a transaction too?
>>>
>> Not usually. Its faster because there's no journalling overhead.
> 
> So what happens if there’s an error part way? Do you end up with some
> records inserted and others not?
> 
Since the load is usually used to populate an empty table, you simply 
empty the table and do it again if it doesn't complete for some reason. 
If you're doing a partial load you'd normally dump the table and/or take 
a backup beforehand.

>> The loader takes out an exclusive table lock, dumps the data into the
>> table, rebuilds indexes and releases the lock.
> 
> That’s not acceptable in a situation where other processes are
> concurrently making queries on the table.
> 
Obviously, but you're unlikely to be running anything while you're doing 
a bulk load. If the database is used to support interactive processing 
you'd normally only use bulk loading during recovery after a disk failure 
or as part of a database migration or restructuring exercise.

In fact the only case I know of where bulk loads are a normal part of the 
data life cycle is for data warehousing, where bulk loads are the norm. 
The last month's transactions are removed from the online system and bulk 
loaded into the data warehouse for analysis. In this case the bulk loads 
are either done overnight or during a weekend. However, any RDBMS that 
has been optimised for data warehousing will almost certainly have the 
ability to segment its fact table. This has two benefits:

- you can bulk load a new segment offline and put it online when the
  load is complete. This can be done during normal operation.

- the DBMS can automatically split a query and run parallel copies
  against each segment before combining the result sets for return
  to the program that made the query.

>>> Seems to me this isn’t going to help, since both old and new tables
>>> are on the same disk, after all. And it’s the disk access that’s the
>>> bottleneck.
>>>
>> There's a lot of overhead in journalling - much more than in applying
>> changes to a table. The before and after images *must* be flushed to
>> disk on commit. In UNIX terms fsync() must be called on the journal
>> file(s) and this is an expensive operation on all OSes because
>> committing a series of small transactions can cause the same disk block
>> to be written several times. However, the table pages can safely be
>> left in the DBMS cache and flushed as part of normal cache operation
>> since, after a crash, the table changes can always be recovered from a
>> journal roll-forward. A good DBMS will do that automatically when its
>> restarted.
> 
> And how does this address the point of whether to load the new records
> into a new table or the same table?
>
That's system design choice. There are a limited set of options and its 
up to the designer to choose one:

- if you want to load data into a new table and rename CURRENT->OLD, 
  NEW->CURRENT you have to design all queries to operate on both OLD
  and CURRENT tables

- if your DBMS allows table segmentation and gives the ability to offline
  segments you can bulk load as I already described.

- you can stop the system during a quiet time while you do a bulk load
  and/or minimise overheads by using very large transaction.

- you can bite the bullet, accept the performance hit and simply run a
  relatively slow load process using small transactions during normal
  operation.

There are different overheads associated with each choice. A good system 
designer will understand them, know their relative weight for the target 
hardware and database, and use this information to optimise the design.

What I said about transactional and journalling overheads was 
explanation: unless you know that you'll never understand why you should 
expect bulk loading to be a lot faster than doing it with transactions 
designed to logically group related operations or (much, much worse) to 
use implied commits. This really is the worse of all worlds since it 
maximises journalling overhead.

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |