[DB-SIG] Improved support for prepared SQL statements

Michael Bayer mike_mp at zzzcomputing.com
Thu Dec 18 21:23:02 CET 2014


> On Dec 18, 2014, at 2:00 PM, SF Markus Elfring <elfring at users.sourceforge.net> wrote:
> 
>> Can someone please take up this task and prove to me with benchmarks that this effort will be worthwhile
>> (or at least show me what I’m going to have to put into my documentation as to when these new APIs are appropriate) ? 
> 
> Some database software supports the processing of prepared SQL statements
> by well-known application programming interfaces for decades, doesn't it?
> 
> Would you like to reuse any more experiences about execution speed and program
> safety from software development history?

Below is a benchmark script, psycopg2 vs. pg8000.  The former uses no prepared statements whatsoever, the latter uses prepared statements for all executions.   To even make it more “fair” I will illustrate an executemany(), so that pg8000 can definitely take advantage of “caching” this statement at the prepared statement level.

The output times, for INSERTing 50000 rows, using five batches of 10000 (so that pg8000 can re-use the same prepared statement 10000 times), is:

	Total time for psycopg2: 4.992107
	Total time for pg8000: 10.770641

Now as it turns out, this test is entirely unfair.  psycopg2 is written in pure C code and talks to libpq directly, whereas pg8000 is pure Python and invokes postgresql’s protocol directly.  There is no chance of pg8000 ever approaching the speed of psycopg2.

In my experiences about execution speed in Python, enhancements like being able to cache prepared statements are dwarfed by the key issues in Python - that of function call overhead and code complexity.    In this case, choosing the DBAPI because one uses prepared statements and the other does not would not be the right approach.  It doesn’t matter that prepared statements are  traditionally useful.   The complexity they would introduce, where programmers take a guess that using prepared statements in conjunction with the necessary statement-caching architecture surrounding their use will make their program faster, will more often than not have a net negative effect.  Working with Openstack developers, I regularly have to shoot down all kinds of changes that are made in the name of “will (theoretically) perform better”, yet in reality, when benchmarked, they perform worse, or hardly better at all, at the expense of greater complexity.   The theories are supported by all kinds of blanket statements such as “iterators are faster”, “using Core statements are faster than the ORM”, “asynchronous code is faster than synchronous”, which at face value are true in some circumstances, but in practice in many specific situations are not at all better from a performance standpoint and only complicate the code. 

Basically I want there to be very good reason for this feature to be an explicit part of the DBAPI because it is going to give me personally a lot of extra headaches when people start asking for it in the name of “performance”, especially since that in practice, psycopg2,  pg8000 and perhaps mxODBC will be the only DBAPIs to add this feature, the MySQL DBAPIs are unlikely to be interested in this, and the pysqlite API almost certainly won’t as they have been stalling for years just on very rudimental transactional issues that remain open.   A DBAPI can already choose to make use of cached prepared statements as an internal optimization, using an LRU cache that could be configurable via the connect() function, or just via the executemany() API as some do right now.   

In my work with consumers of the DBAPI, prepared statement support is not high (or even at all) on the list of needs.  DBAPIs that can do unicode very well, support asynchronous APIs such as asyncio as well as implicit environments such as gevent, have lots of robust datatype support as well as very robust failure mode handling, that’s what people are looking for (also areas that psycopg2 does extremely well), and these are all areas in which the actual pep-249 is IMO sorely lacking in guidance and could use more immediate attention first.


import random
import timeit
import psycopg2
import pg8000


def setup(conn):
    cursor = conn.cursor()
    cursor.execute("drop table if exists data")
    cursor.execute(
        "create table data (id SERIAL primary key, data VARCHAR(100))")
    cursor.close()


def run_test(conn):
    cursor = conn.cursor()
    cursor.executemany(
        "insert into data (data) values (%s)",
        [
            ("some value: %d" % random.randint(0, 10000),)
            for counter in xrange(10000)
        ]
    )
    cursor.close()


def do_time(dbapi):
    global conn
    conn = dbapi.connect(
        user='scott', password='tiger',
        database='test', host='localhost')

    time = timeit.timeit(
        "run_test(conn)",
        "from __main__ import run_test, setup, conn; setup(conn)",
        number=5
    )
    conn.close()
    return time

psycopg2_time = do_time(psycopg2)
pg8000_time = do_time(pg8000)

print("Total time for psycopg2: %f" % psycopg2_time)
print("Total time for pg8000: %f" % pg8000_time)




More information about the DB-SIG mailing list