[DB-SIG] DB-API 1.1

Thu, 04 Jun 1998 01:36:37 +0200

Jim Fulton wrote:
> 
> M.-A. Lemburg wrote:
> >
> > M.-A. Lemburg wrote:
> > >
> > > I'd like to start discussing updating/extending the DB-API 1.0 to
> > > a new version. IMHO, this needs to be done to
> > >         a. clearify a few minor tidbits
> > >         b. enable a more informative exception mechanism
> > >
> > > You can have a look at an alpha version of 1.1 at:
> > >         http://starship.skyport.net/~lemburg/DatabaseAPI-1.1.html
> >
> > Looks like everybody is too busy ... oh well. I'll wait another
> > week or so then and repost the RFC.
> 
> I applaud you for taking this on....
> 
> Here are some comments.  I don't know what has changed since

Not much.

> 1.0, so some things I gripe about may be inherited from 1.0.

That's ok ;-)

> 
> I didn't comment enough on the 1.0 spec, so I'll try to make up
> for it now.
> 
> >
> >
> > Module Interface
> >
> > The database interface modules should typically be named with
> > something terminated by db.
> 
> Why?

Good question :-) See my previous post for an alternative:

1. Instead of defining the connection constructor to be named
   <modulename>, I think Connect(...) is a better choice (helps porting
   applications from one DB to another).

> 
> > Existing examples are: oracledb,
> > informixdb, and pg95db. These modules should export several names:
> >
> > modulename(connection_string_or_tuple)
> 
> Why use the module name?  Why not some descriptive name
> like: 'Connect'?

Right.

> 
> >      Constructor for creating a connection to the database. Returns a
> >      Connection Object. In case a connection tuple is used, it should
> >      follow this convention: (data_source_name, user, password).
> 
> Why allow a string or a tuple?  Doesn't this add non-portability?

Some databases want to be passed three (or more) different
parameters for the connect. Putting them all into a connect
string would add an extra parse step.

> 
> > error
> >
> >      Exception raised for errors in the the database module's internal
> >      processing. Other errors may also be raised. Database error
> >      should in general be made accessible through the exceptions
> >      defined for the dbi abstraction module described below.
> 
> Maybe this should be InternalError.  Is this a DBI defined error,
> or is it local to the module?  Does it subclass from the DBI.Error
> defined below?

Hmm, InternalError refers to the database not the interface.
Maybe we should add an InterfaceError that subclasses from Error.

> 
> > Connection Objects
> >
> > Connections Objects should respond to the following methods:
> >
> > close()
> >
> >      Close the connection now (rather than whenever __del__ is
> >      called). The connection will be unusable from this point forward;
> >      an exception will be raised if any operation is attempted with
> >      the connection.
> >
> > commit()
> >
> >      Commit any pending transaction to the database. Note that if the
> >      database supports an auto-commit feature, this must be initially
> >      off. An interface method may be provided to turn it back on.
> >
> > rollback()
> >
> >      Roll the database back to the start of any pending
> >      transaction. Note that closing a connection without committing
> >      the changes first will cause an implicit rollback to be
> >      performed.
> 
> Why not have a begin() method?

What for ?

> 
> > cursor()
> >
> >      Return a new Cursor Object. An exception may be thrown if the
> >      database does not support a cursor concept.
> >
> > callproc([params])
> >
> >      Note: this method is not well-defined yet.  Call a stored
> >      database procedure with the given (optional) parameters. Returns
> >      the result of the stored procedure.
> 
> How are IN OUT and OUT parameters handled?
> How common are stored procedures across database products
> anyway?

I don't think that there is a lot of portability regarding stored
procedures. For one, the storing process itself is *very* DB
dependant (with each database having its own syntax for defining
procedures) and it is not really clear how to handle the parameters
(esp. the IN OUT ones you mention).

> > all Cursor Object attributes and methods
> >
> >      For databases that do not have cursors and for simple
> >      applications that do not require the complexity of a cursor, a
> >      Connection Object should respond to each of the attributes and
> >      methods of the Cursor Object. Databases that have cursor can
> >      implement this by using an implicit, internal cursor.
> >
> > Cursor Objects
> >
> > These objects represent a database cursor, which is used to
> > manage the context of a fetch operation.
> >
> > Cursor Objects should respond to the following methods and attributes:
> >
> > arraysize
> >
> >      This read/write attribute specifies the number of rows to fetch
> >      at a time with fetchmany(). This value is also used when
> >      inserting multiple rows at a time (passing a tuple/list of
> >      tuples/lists as the params value to execute()). This attribute
> >      will default to a single row.
> >
> >      Note that the arraysize is optional and is merely provided for
> >      higher performance database interactions. Implementations should
> >      observe it with respect to the fetchmany() method, but are free
> >      to interact with the database a single row at a time.
> 
> Doesn't fetchmany accept a count?  Why is this attribute
> needed?

I guess this allows the interface to prefetch arraysize many
rows in one go. Don't have any experience with it though.

> 
> > description
> >
> >      This read-only attribute is a tuple of 7-tuples. Each 7-tuple
> >      contains information describing each result column: (name,
> >      type_code, display_size, internal_size, precision, scale,
> >      null_ok). This attribute will be None for operations that do not
> >      return rows or if the cursor has not had an operation invoked via
> >      the execute() method yet.
> >
> >      The type_code is equal to one of the dbi type objects specified
> >      in the section below.
> >
> >      Note: this is a bit in flux. Generally, the first two items of
> >      the 7-tuple will always be present; the others may be database
> >      specific.
> 
> This is bad.  I suppose we are stuck with this for backwards
> compatibility.

I guess so, too :(

> 
> If I were designing this interface I would have description
> be a collection object that acted as both a sequence of
> column definitions and a mapping from column name to column
> definitions.  I would have the column definitions be objects
> that have methods for standard attributes like name, and type
> (and maybe nullability, scale and precision)
> as well as optional attributes for things like display size and
> internal size.

Especially those last two don't make too much sense now-a-days
(probably did back when all output had to formatted for 80-
column display/printers).

> I suppose that with some trickery, this could be handled in a mostly
> backward compatible way, by making column-definitions sequence objects
> too.

We could change the specification to: ... is a read-only
sequence of sequences, each describing one output column ...

I don't think that any code really depends on real tuples
for description - as long as a,b,c = cursor.description[:3]
works everything should be fine.

> > close()
> >
> >      Close the cursor now (rather than whenever __del__ is
> >      called). The cursor will be unusable from this point forward; an
> >      exception will be raised if any operation is attempted with the
> >      cursor.
> >
> > execute(operation [,params])
> >
> >      Execute (prepare) a database operation (query or
> >      command). Parameters may be provided (as a sequence
> >      (e.g. tuple/list)) and will be bound to variables in the
> >      operation. Variables are specified in a database-specific
> >      notation (some DBs use ?,?,? to indicate parameters, others
> >      :1,:2,:3) that is based on the index in the parameter tuple
> >      (position-based rather than name-based).
> 
> The format of parameter references should be standardized.
> Maybe with something more Pythonic, like:
> 
>   %s, %d, %f
> 
> This might also allow type information to be captured to
> aid in binding variables.

That won't work, since the arguments passed in the input sequence
are handled by the database not the interface. E.g. for ODBC the
database may define the type it wants to receive for each parameter.
As example: the ODBC driver for MySQL always wants to be passed
strings regardless of the "real" type.

> >      The parameters may also be specified as a sequence of sequences
> >      (e.g. a list of tuples) to insert multiple rows in a single
> >      operation.
> 
> Does this run the insert multiple times, or does it bind
> some sorts of arrays to input parameters?  Is this
> useful enough to include in this standard?  It feels like
> alot of extra burden for DBI interface developers.

The interface can implement this as a loop inserting one row
at a time or an array of rows which it passes to the database
in one go. The latter is usually faster for database interfaces
that rely heavily on network communication.

> >      A reference to the operation will be retained by the cursor. If
> >      the same operation object is passed in again, then the cursor can
> >      optimize its behavior. This is most effective for algorithms
> >      where the same operation is used, but different parameters are
> >      bound to it (many times).
> 
> This sounds a bit too magical to me.  Does it apply when no arguments
> are presented?

For ODBC this saves doing a prepare for every execute (well, sometimes
at least). It does also apply when no arguments are present.

>  I'd rather see an explicit prepare method, preferably
> on a connection object that returns a callable object, as in:
> 
>   f=aConnection.prepare(
>       "select * from mydata where id=%d and name=%s")
>   ...
>   x=f(1, 'foo')
>   ...
>   y=f(2, 'bar')
> 

Don't know if all DBs support preparing SQL statements. You can
already do this with cursors and a little magic (using the
"keep a reference" feature).

> 
> >      For maximum efficiency when reusing an operation, it is best to
> >      use the setinputsizes() method to specify the parameter types and
> >      sizes ahead of time. It is legal for a parameter to not match the
> >      predefined information; the implementation should compensate,
> >      possibly with a loss of efficiency.
> 
> I think that this could better be handled using more pythonic
> place holders.  I don't like having to specify sizes for strings,
> since I may want to use a type like 'long var binary' that effectively
> doesn't have an upper limit.

Well, I think this was meant to have the interface pre-allocate
memory for the data transfer. Don't think it's needed anymore, though,
since most memory allocators only reserve memory rather than actually
make it available (this is done on request indicated by page faults).

> 
> >      Using SQL terminology, these are the possible result values from
> >      the execute() method:
> >
> >           If the statement is DDL (e.g. CREATE TABLE), then 1 is
> >           returned.
> 
> This seems a bit arbitrary to me.

Sure is and it's hard to implement too. Some interfaces will have
to parse the SQL statement just to be compliant to these result
values...

> >           If the statement is DML (e.g. UPDATE or INSERT), then the
> >           number of rows affected is returned (0 or a positive
> >           integer).
> >
> >           If the statement is DQL (e.g. SELECT), None is returned,
> >           indicating that the statement is not really complete until
> >           you use one of the fetch methods.
> >
> > fetchone()
> >
> >      Fetch the next row of a query result, returning a single tuple,
> >      or None when no more data is available.
> >
> > fetchmany([size])
> >
> >      Fetch the next set of rows of a query result, returning as a list
> >      of tuples. An empty list is returned when no more rows are
> >      available. The number of rows to fetch is specified by the
> >      parameter. If it is None, then the cursor's arraysize determines
> >      the number of rows to be fetched.
> >
> >      Note there are performance considerations involved with the size
> >      parameter. For optimal performance, it is usually best to use the
> >      arraysize attribute. If the size parameter is used, then it is
> >      best for it to retain the same value from one fetchmany() call to
> >      the next.
> >
> > fetchall()
> >
> >      Fetch all (remaining) rows of a query result, returning them as a
> >      list of tuples. Note that the cursor's arraysize attribute can
> >      affect the performance of this operation.
> 
> For the record, I've never liked this approach.  When I've done this
> sort of thing before (For Ingres and Info, sorry, I can't share the
> code, it was done while at USGS), I had selects return "result"
> objects.  Result objects encapsulated cursors and have them sequence
> behavior. As in:
> 
>   rows=aConnection.execute('select * from blech')
>   # (Note, no explicit cursor objects)
>   for row in rows:
>      ... do something with the rows
> 
> Note that the "rows" object in this example is not a tuple
> or list, but an object that lazily gets rows from the
> result as needed.
> 
> Also note that the individual rows are not tuples, but
> objects that act as a sequence of values *and* as a mapping
> from column name to value.  This lets you do something like:
> 
>   rows=aConnection.execute('select name, id from blech')
>   for row in rows:
>      print "%(name), %(id)" % row
> 
> In my Ingres and Info interfaces, I also had the
> rows have attributes (e.g. aRow.name), but then it's
> hard for rows to have generic methods, like 'keys' and
> 'items'.  I also provided access to meta data for rows,
> something like:
> 
>   row.__schema__

We could change that to ... return the result set as sequence
of sequences ... as stated above, I think the only compatibility
issue is having a,b,c = sequence work (and all that's needed is
a working __getitem__ method).

> 
> > nextset()
> >
> >      If the database supports returning multiple result sets, this
> >      method will make the cursor skip to the next available set. If
> >      there are no more sets, the method returns None. Otherwise, it
> >      returns 1 and subsequent calls to the fetch methods will return
> >      rows from the next result set. Database interface modules that
> >      don't support this feature should always return None.
> 
> This feels a bit cumbersome to me.  What happens if you need
> to iterate over multiple results simulataneously.  I'd rather
> see an object for each result set and return a tuple of result
> sets if there are more than one.

Sorry, can't really comment on that one, since I have no experience
with it. Only one thing: introducing too many different new
types will put a real strain on interface implementors. We should
avoid that.

> 
> > setinputsizes(sizes)
> >
> >      Note: this method is not well-defined yet.  This can be used
> >      before a call to execute() to predefine memory areas for the
> >      operation's parameters. sizes is specified as a tuple -- one item
> >      for each input parameter. The item should be a Type object that
> >      corresponds to the input that will be used, or it should be an
> >      integer specifying the maximum length of a string parameter. If
> >      the item is None, then no predefined memory area will be reserved
> >      for that column (this is useful to avoid predefined areas for
> >      large inputs).
> >
> >      This method would be used before the execute() method is invoked.
> >
> >      Note that this method is optional and is merely provided for
> >      higher performance database interaction. Implementations are free
> >      to do nothing and users are free to not use it.
> 
> See above.
> 
> > setoutputsize(size [,col])
> >
> >      Note: this method is not well-defined yet.  Set a column buffer
> >      size for fetches of large columns (e.g. LONG). The column is
> >      specified as an index into the result tuple. Using a column of
> >      None will set the default size for all large columns in the
> >      cursor.
> >
> >      This method would be used before the execute() method is invoked.
> >
> >      Note that this method is optional and is merely provided for
> >      higher performance database interaction. Implementations are free
> >      to do nothing and users are free to not use it.
> 
> In the case of LONG columns, how is someone suppose to know the maximum
> size ahead of time?  Does anyone really want this?

Not me... (see above)

> 
> > DBI Helper Objects and Exceptions
> >
> > Many databases need to have the input in a particular format for
> > binding to an operation's input parameters. For example, if an input
> > is destined for a DATE column, then it must be bound to the database
> > in a particular string format. Similar problems exist for "Row ID"
> > columns or large binary items (e.g. blobs or RAW columns). This
> > presents problems for Python since the parameters to the execute()
> > method are untyped.
> 
> They don't have to be. See above.
> 
> > When the database module sees a Python string
> > object, it doesn't know if it should be bound as a simple CHAR column,
> > as a raw binary item, or as a DATE.
> >
> > To overcome this problem, the dbi interface module was created. This
> > module, which every database module must provide, specifies some basic
> > database interface types for working with databases. There are two
> > classes: dbiDate and dbiRaw. These are simple container classes that
> > wrap up a value. When passed to the database modules, the module can
> > then detect that the input parameter is intended as a DATE or a
> > RAW.
> 
> I suggest doing away with these through use of parameters like
> %r for raw and %t for date time, or whatever.

Not sure how you'd implement this without parsing the SQL string.

> > For symmetry, the database modules will return DATE and RAW
> > columns as instances of these classes.
> 
> I'd rather see strings come back for RAW and "Date" objects
> come back for dates.  I'd prefer to see the Date type be pluggable.

Of course, I'd like to see everybody you DateTime[Delta] types ;-)
After all, they were created for exactly this reason... but
agreeing on a common interface should be enough.

> > A Cursor Object's description attribute returns information about each
> > of the result columns of a query. The type_code is defined to be equal
> > to one of five types exported by this module: STRING, RAW, NUMBER,
> > DATE, or ROWID.
> 
> There needs to be a distinction between ints and floats.

IMHO, we should simply use the standard Python types the way
they are meant: numbers with precision become floats, ones without
precision become integers. Very long integers (BIGINT) are output
as longs.

> 
> > Note: The values returned in the description tuple must not
> > necessarily be the same as the defined types, i.e. while coltype ==
> > STRING will always work, coltype is STRING may fail.
> 
> Why?

Simple: to allow the interface implementor to use some extra
magic. E.g. the mxODBC module will give you a much more fine
grained coltype description than the one defined by the dbi
module, so multiple coltype values will have to compare equal
to say NUMBER (floats, integers, decimals, ...).

> 
> > The module exports the following functions and names:
> >
> > ...
> 
> If dbi exports a C API, it should do so through a Python
> CObject.  This should avoid weird linking problems (that
> the oracledb and ctsybase modules have.

Good idea.

-- 
Marc-Andre Lemburg
----------------------------------------------------------------------
             | Python Pages:  http://starship.skyport.net/~lemburg/  |
              -------------------------------------------------------