[Python-Dev] Unicode support in getargs.c

M.-A. Lemburg mal@lemburg.com
Thu, 03 Jan 2002 11:34:17 +0100


"Martin v. Loewis" wrote:
> 
> > I have a number of MacOSX API's that expect Unicode buffers, passed as
> > "long count, UniChar *buffer".
> 
> Well, my first question would be: Are you sure that UniChar has the
> same underlying integral type as Py_UNICODE? If not, you lose.
> 
> So you may need to do even more conversion.

This should be the first thing to check. Also note that Python
has two different flavors of Unicode support: UCS-2 and UCS-4,
so you'll have to be careful about this too.
 
> > I have the machinery in bgen to generate code for this, iff "u#" (or
> > something else) would work the same as "s#", i.e. it returns you a
> > pointer and a size, and it would work equally well for unicode
> > objects as for classic strings (after conversion).
> 
> I see. u# could be made work for Unicode objects alone, but it would
> have to reject string objects.

Martin, I don't agree here: string objects could hold binary UCS-2/UCS-4 
data.

Jack, u# cannot auto-convert strings to Unicode since this would
require allocation of a temporary object and there's no logic there
to free that object after usage.

es# has logic in place which allows either copying the raw data
to a buffer you provide or have it allocate a buffer of the
right size for you. That's why I proposed to extend it support
Unicode raw data as well.

> > But as a general solution it doesn't look right: "How do I call a C
> > routine with a string parameter?" "Use the "s" format and you get the
> > string pointer to pass". "How do I call a C routine with a unicode string
> > parameter?"
> 
> For that, the answer is u. But you want the length also. So for that,
> the answer is u#. But your question is "How do I call a C routine with
> either a Unicode object or a string object, getting a reasonable
> Py_UNICODE* and the length?".
> 
> For that, I'd recommend to use O&, with a conversion function
> 
> PyObject *Py_UnicodeOrString(PyObject *o, void *ignored)){
>   if (PyUnicode_Check(o)){
>     Py_INCREF(o);return o;
>   }
>   if (PyString_Check(o)){
>     return PyUnicode_FromObject(o);
>   }
>   PyErr_SetString(PyExc_TypeError,"unicode object expecpected");
>   return NULL;
> }

Martin, note that PyUnicode_FromObject() already does the Unicode
pass-through (even more: it makes sure that you get a true Unicode
object, not a subclass).
 
> > "Use O and PyUnicode_FromObject() and PyUnicode_AsUnicode and
> > make sure you get all your decrefs right and.....".
> 
> With the function above, this becomes
> 
> Use O&, passing a PyObject**, the function, and a NULL pointer, using
> PyUnicode_AS_UNICODE and PyUnicode_SIZE, performing a single DECREF at
> the end [allowing to specify an encoding is optional]
> 
> In this scenario, somebody *has* to deallocate memory, you cannot get
> around this. It is your choice whether this is Py_DECREF or PyMem_Free
> that you have to call (as with the "esomething" conversions); the
> DECREF is more efficient as it will not copy a Unicode object.
>
> > The "es#" is a very strange beast, and a similar "eu#" would help me a
> > little, but it has some serious drawbacks. Aside from it being completely
> > different from the other converters (being a prefix operator in stead of a
> > postfix one, and having a value-return argument) I would also have to
> > pre-allocate the buffer in advance, and that sort of defeats the purpose.
> 
> You don't. If you set the buffer to NULL before invoking getargs, you
> have to PyMem_Free it afterwards.

Right.

Let me see if I can summarize this:

Jack wants to get string and Unicode objects converted to Unicode 
automagically and then receive a pointer to a Py_UNICODE buffer and
a size. 

The current solution for this is to use the "O" parser,
fetch the object, pass it through PyUnicode_FromObject(), then
use PyUnicode_GET_SIZE() and PyUnicode_AS_UNICODE() to access
the Py_UNICODE buffer and finally to Py_DECREF() the object returned
by PyUnicode_FromObject().

What I proposed was to extend the "es#" parser marker with a new
modifier: "eu#" which does all of the above except that it either
copies the Py_UNICODE data to a buffer you provide or a newly
allocated buffer which you then have to PyMem_Free() after usage.

How does this sound ?

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/