[DB-SIG] annotated 1.1 spec / feedback

M.-A. Lemburg mal@lemburg.com
Sat, 13 Mar 1999 21:56:28 +0100


Greg Stein wrote:
> 
> M.-A. Lemburg wrote:
> > > * the parameters should be positional, rather than keyword-based.
> > > keywords can be optional, but we should not require the params to be
> > > keyword-based.
> >
> > But keywords are so much simpler to maintain: the order doesn't
> > matter, only the wording. They also serve as nice way to document
> > what the parameter stands for, e.g.
> >
> > connect('degas','miro','kline')
> >
> > isn't too easy to comprehend without first looking at the interface's
> > docs, yet
> >
> > connect(host='degas',user='miro',password='kline')
> >
> > is.
> >
> > If you fear the programming overhead, I can post a few C macros
> > that make dealing with keywords in C really simple.
> 
> I believe we can only require keywords if we prepare an appendix to the
> specification. As a module implementor, I do not do keywords... I always
> use positional because they are much simpler for C-implemented modules.
> 
> If we do not have a simple reference, then I believe we cannot require
> keywords. Even then, I am hesitant to do so.
> 
> [and yes, I fully understand their benefit for the user of the module;
> let the user fix it in Python if it bugs them]

Actually, they are just as simple to implement as positional
parameters. Here is some code from an extension I wrote:

Py_C_Function_WithKeywords(
    mxBeeIndex_BeeStringIndex,
    "BeeStringIndex(filename,keysize,dupkeys=0,filemode=0,sectorsize=512)\n\n"
    )
{
    char *filename;
    int keysize;
    int sectorsize = 512;
    int dupkeys = 0;
    int filemode = 0;

    Py_KeywordsGet5Args("si|iii",
			filename,keysize,dupkeys,filemode,sectorsize);

    return (PyObject *)mxBeeIndex_New(filename,filemode,
				      keysize+1,sectorsize,
				      mxBeeIndex_CompareStrings,
				      mxBeeIndex_StringFromKey,
				      mxBeeIndex_KeyFromString,
				      dupkeys);
 onError:
    return NULL;
}

...plus:

static 
PyMethodDef Module_methods[] =
{   
    Py_MethodWithKeywordsListEntry("BeeStringIndex",mxBeeIndex_BeeStringIndex),
    Py_MethodWithKeywordsListEntry("BeeIntegerIndex",mxBeeIndex_BeeIntegerIndex),
    Py_MethodWithKeywordsListEntry("BeeFloatIndex",mxBeeIndex_BeeFloatIndex),
    {NULL,NULL} /* end of list */
};

The macros used are:

#define Py_KeywordGet5Args(keywords,format,a1,a2,a3,a4,a5) {if (!PyArg_ParseTupleAndKeywords(args,kw,format,keywords,&a1,&a2,&a3,&a4,&a5)) goto onError;}
#define Py_C_Function_WithKeywords(fct,docstr) \
        static char fct##_docstring[] = docstr;\
        static PyObject *fct(PyObject *self, PyObject *args, PyObject *kw)
#define Py_MethodWithKeywordsListEntry(pyname,cname) {pyname,(PyCFunction)cname,METH_VARARGS | METH_KEYWORDS,cname##_docstring}


It's really no big deal...

> >...
> > > * maybe we should discuss punting dbi. module distribution may be easier
> > > without it. While it is handy for clients that use more than one DB
> > > module,
> >
> > Actually, it makes things harder for those cases (see above). mxODBC
> > supports opening connections to many different databases and each
> > has its own version of the dbi module because each subpackage uses
> > its own set of exception classes.
> 
> If we tossed the dbi module, then each database would incorporate its
> dbi into the module itself. I don't understant what the problem is here.
> 
> I'm definitely leaning towards axing dbi.

Me too. So the problem is gone anyway...

> > > I can easily envision a python module that exports the
> > > appropriate set for the DB being used.
> >
> > Right: stick it all into the database interface package.
> 
> I'd like to see:
> 
> import mysqldb
> ...
>   except mysqldb.IntegrityError:
> ...
>   if curs.description[0][0] == mysqldb.STRING:
> ...

Right.
 
> > > * cursor.rowcount is new, and is extraneous. The result of execute()
> > > should be used.
> >
> > The execute() method returns None in case there are rows available.
> > It does not return the number of rows in the result set.
> 
> I am not sure that all databases can provide you the rowcount after an
> execute(). You may need to fetch them first. Therefore, the .rowcount
> couldn't even be implemented. This means that the only real way to know
> how many rows there are is to fetch them -- rowcount could not be relied
> upon.

Ehm, is that a problem ? If the database interface cannot determine
the exact row count, it should set it to -1 (I'll add a note about this).
Note that the 1.0 spec didn't give you the rowcount information in
any way for simple SELECT statements (only for DML statements).

> > > * I think we should clean up the description attribute if possible. The
> > > 7-tuple is/was very Oracle-specific.
> >
> > Ok, things like display_size and internal_size are not really
> > relevant, but why break existing code expecting these 7 entries (even
> > if it probably does not use all of them) ?
> 
> I believe we should alter the specification, but that databases can
> continue to return an appropriate number of fields.
> 
> For example: can we require two fields -- name and type. Beyond that, it
> is up to the database module. Implementors can choose to continue with
> the 7-tuple or bring it down to what they really support.

Hmm, but that would introduce database interface dependency: the
module can always provide some sensible default values for the
missing items, but the size and ordering of the tuple should not change
from one DB interface to another.

> > > * The old description for execute() returning None was clearer. Note
> > > that a SELECT can return no columns/rows (where your description says >
> > > 0).
> >
> > The None return value is to indicate that data is available
> > to be fetched vie the fetchXXX() methods. If no rows where produced,
> > something else should be returned, e.g. 1.
> 
> I do not believe that a database can always tell that rows were or were
> not produced. You may need to fetch them first. Therefore, returning 1
> is not possible for certain modules.
> 
> Okay. I just looked at DCOracle. The execute() method does just that...
> an execute. No fetching, so no way to tell what the row count is.
> 
> Admittedly, there may be something in the OCI interface, or in the V8
> Oracle interfaces to do so. Unless sometime can point out that the
> Oracle interfaces support knowing the row count at execute() time, then
> we cannot rely on your .rowcount and return-value changes.
> 
> Even so, I do not like seeing a SELECT returning "1" *ever*. That just
> doesn't fit right.

How about this: the return value of .execute remains undefined.
If the database module can figure out how many rows were affected or
are available via fetchXXX() it sets the .rowcount attribute accordingly.
Otherwise, .rowcount is set to -1.

That way we clear up the specification and still provide a standard
way to report row count information.

> > The reason I changed the description was that the previous one
> > wasn't easy to implement: it is not clear whether a statement
> > is DDL, DML or DQL without parsing it...
> 
> Untrue. It is easy to implement -- you can always tell from the first
> word in all cases.

I'm not sure I follow you there: the SQL dialects allow for
many different ways to express DDL, DML or DQL each with its
own set of extensions...

Apart from that I don't understand the use of being able to
differentiate between those three categories: as far as I can
tell it does not gain you any knowledge, since you normally
know in advance what king of statement you are executing.

>From the programmers view it is definitely easier to implement
the spec 1.1 return value. I have been in exactly this situation
when I implemented mxODBC: there is just no clean way of telling
DDL, DML and DQL apart because each database has its own set
of extensions and even the ODBC levels provide different sets
of SQL dialects.
 
> > > * in general, I think the 1.1 change to the return value description
> > > didn't really clear things up. Basing the descriptions on returning
> > > columns/rows doesn't make much sense to me. Instead, I liked the
> > > description based on the *type* of statement that was performed. Under
> > > this style, a SELECT that doesn't produce columns/rows would return 1,
> > > but that is just wrong.
> >
> > See above. Without parsing there isn't much you can do as module
> > implementor to get the return value right. Maybe we should drop the
> > return value altogether: rowcount gives the number of rows in the
> > result set (even 0 for SELECTs that didn't produce any rows) or the
> > number of affected rows.
> 
> As I mentioned, I think the rowcount is not possible for all databases
> that use a pure execute/fetch model like Oracle.

With the relaxed definition of .rowcount there should be no
problem.

> > > * execute()'s description should change back to the terms "sequence" and
> > > "sequence of sequences" for the parameters. There is no reason to force
> > > them to be tuples or lists.
> >
> > I thought about this: there is no way for the method to tell whether
> > it was passed a sequence of sequences that is meant to be a
> > sequence of parameter sequences or just a single sequence with
> > sequence entries, e.g. ('abc',) would break under that definition.
> >
> > Is there really such a big need to be able to pass *any* sequence
> > as parameter ?
> 
> Yes.
> 
> What if you have a very large dataset which is encapsulated within a
> class instance? Let's say it responds properly to the sequence behavior,
> specifically so that it can be inserted without having to construct a
> big ol' list to do so. I also can easily see rows that identified by
> class instances or other builtin types, rather than always being tuples
> or lists.
> 
> It certainly does hurt the implementor to use PySequence_GetItem()
> rather than PyTuple_GetItem().
> 
> An alternative would be an "insert" method that specifically takes a
> sequences of rows (where each row is a sequence). This would eliminate
> the confusion between a sequence-of-sequence being a single row or a set
> of rows.
> 
> I believe the INSERT is the only case where a sequence input makes
> sense.

I think just about any DML statement fits the sequence of
sequences category, not just INSERT...

I'm not opposing the sequence argument, it's only that the
module implementor has no chance of telling whether he's seeing a
sequence of sequences or a plain old sequence of strings...

How about adding a .executemany() API that is intended for
sequence of sequences parameters only ? (or some other way of
stating that the parameters sequence is in fact a seq of sequences).

> > > * the part about cursor movement in fetchone() should be totally
> > > removed. fetchone() is about the CLIENT fetching a single row. It should
> > > NOT affect how the underlying module performs its fetches.
> >
> > That renders cursors useless... are you sure you want this ?
> 
> Absolutely. Positively. There is no way that we can or should require
> single-row fetching.
> 
> Look at DCOracle. It uses the API nearly to its full extent (it doesn't
> use input or output sizes to optimize its binding, though). Requiring
> fetchone() to disable the array fetching would be a mistake, let alone
> the fact that when you execute() the array-fetch there is no way to know
> whether the person will be using fetchone or not!

So cursors are deemed non-portable... given the fact that cursor
implementations differ greatly from DB to DB that is probably
an arguable approach.

I'll remove the note from .fetchone().

> > > * the fetch* methods should say they return a sequence or a sequence of
> > > sequences. We should not force them to be tuples or lists of tuples.
> > > Some DB modules may want to have additional semantics or capabilities on
> > > their results. As long as it obeys sequence behavior, then the spec
> > > should be happy. The arraysize attribute's description also refers to a
> > > list of tuples, and should be updated.
> >
> > Just thought it would be a good idea to keep execute() parameter
> > input and fetchXXX() output in sync.
> 
> Right. And I think they should all sync up to be a
> sequence-of-sequences. I can see an implementor wanting to create a
> high-performance binding against a database that serves up values
> directly from the output buffers rather than jamming them into lists of
> tuples (with the requisite allocations).
> 
> Jim Fulton has argued for a long while that the module should be able to
> return a highly functional result set. As long as it obeys the
> sequence-of-sequence behavior, then he could now do so, yet still
> "follow the spec".

If we find a suitable way to implement this, fine. Just
stating that the parameters could be a sequence of sequences
is not enough though.

The updated spec (1.1a8) is at:

	http://starship.skyport.net/~lemburg/DatabaseAPI-1.1.html

I've folded dbi into the module interface and added constructors
Date, Time and Timestamp (as well as corresponding type codes).
Note that only the constructor names are defined -- not whether
the result from calling these specifies a certain type or instance.

Hmm. Maybe it'll be 2.0 after all... punting dbi, new
interfaces, different semantics...

-- 
Marc-Andre Lemburg                               Y2000: 293 days left
---------------------------------------------------------------------
          : Python Pages >>> http://starship.skyport.net/~lemburg/  :
           ---------------------------------------------------------