[Python-Dev] argument parsing (was: just say no...)

Fri, 12 Nov 1999 16:49:34 -0800 (PST)

On Sat, 13 Nov 1999, Mark Hammond wrote:
>...
> Im inclined to agree that holding 2 internal buffers for the unicode
> object is not ideal.  However, I _am_ concerned with getting decent
> PyArg_ParseTuple and Py_BuildValue support, and if the cost is an
> extra buffer I will survive.  So lets look for solutions that dont
> require it, rather than holding it up as evil when no other solution
> is obvious.

I believe Py_BuildValue is pretty straight-forward. Simply state that it
is allowed to perform conversions and place the resulting object into the
resulting tuple.
(with appropriate refcounting)

In other words:

  tuple = Py_BuildValue("U", stringOb);

The stringOb will be converted to a Unicode object. The new Unicode object
will go into the tuple (with the tuple holding the only reference!). The
stringOb will NOT acquire any additional references.

[ "U" format may be wrong; it is here for example purposes ]

Okay... now the PyArg_ParseTuple() is the *real* kicker.

>...
> Prob1:
>   name = SomeComObject.GetFileName() # A Unicode object
>   f = open(name)
> Prob2:
>   SomeComObject.SetFileName("foo.txt")

Both of these issues are due to PyArg_ParseTuple. In Prob1, you want a
string-like object which can be passed to the OS as an 8-bit string. In
Prob2, you want a string-like object which can be passed to the OS as a
Unicode string.

I see three options for PyArg_ParseTuple:

1) allow it to return NEW objects which must be DECREF'd.
   [ current policy only loans out references ]

   This option could be difficult in the presence of errors during the
   parse. For example, the current idiom is:

     if (!PyArg_ParseTuple(args, "..."))
        return NULL;

   If an object was produced, but then a later argument cause a failure,
   then who is responsible for freeing the object?

2) like step 1, but PyArg_ParseTuple is smart enough to NOT return any new
   objects when an error occurred.

   This basically answers the last question in option (1) -- ParseTuple is
   responsible.

3) Return loaned-out-references to objects which have been tested for
   convertability. Helper functions perform the conversion and the caller
   will then free the reference.
   [ this is the model used in PyWin32 ]

   Code in PyWin32 typically looks like:

     if (!PyArg_ParseTuple(args, "O", &ob))
       return NULL;
     if ((unicodeOb = GiveMeUnicode(ob)) == NULL)
       return NULL;
     ...
     Py_DECREF(unicodeOb);

   [ GiveMeUnicode is descriptive here; I forget the name used in PyWin32 ]

   In a "real" situation, the ParseTuple format would be "U" and the
   object would be type-tested for PyStringType or PyUnicodeType.

   Note that GiveMeUnicode() would also do a type-test, but it can't
   produce a *specific* error like ParseTuple (e.g. "string/unicode object
   expected" vs "parameter 3 must be a string/unicode object")

Are there more options? Anybody?

All three of these avoid the secondary buffer. The last is cleanest w.r.t.
to keeping the existing "loaned references" behavior, but can get a bit
wordy when you need to convert a bunch of string arguments.

Option (2) adds a good amount of complexity to PyArg_ParseTuple -- it
would need to keep a "free list" in case an error occurred.

Option (1) adds DECREF logic to callers to ensure they clean up. The add'l
logic isn't much more than the other two options (the only change is
adding DECREFs before returning NULL from the "if (!PyArg_ParseTuple..."
condition). Note that the caller would probably need to initialize each
object to NULL before calling ParseTuple.

Personally, I prefer (3) as it makes it very clear that a new object has
been created and must be DECREF'd at some point. Also note that
GiveMeUnicode() could also accept a second argument for the type of
decoding to do (or NULL meaning "UTF-8").

Oh: note there are equivalents of all options for going from
unicode-to-string; the above is all about string-to-unicode. However, the
tricky part of unicode-to-string is determining whether backwards
compatibility will be a requirement. i.e. does existing code that uses the
"t" format suddenly achieve the capability to accept a Unicode object?
This obviously causes problems in all three options: since a new reference
must be created to handle the situation, then who DECREF's it? The old
code certainly doesn't.
[ <IMO> I'm with Fredrik in saying "no, old code *doesn't* suddenly get
  the ability to accept a Unicode object." The Python code must use str() to
  do the encoding manually (until the old code is upgraded to one of the
  above three options). </IMO> ]

I think that's it for me. In the several years I've been thinking on this
problem, I haven't come up with anything but the above three. There may be
a whole new paradigm for argument parsing, but I haven't tried to think on
that one (and just fit in around ParseTuple).

Cheers,
-g

--
Greg Stein, http://www.lyra.org/