[Python-Dev] Unicode support in getargs.c

Martin v. Loewis martin@v.loewis.de
Thu, 3 Jan 2002 22:38:56 +0100


> > I see. u# could be made work for Unicode objects alone, but it would
> > have to reject string objects.
> 
> Martin, I don't agree here: string objects could hold binary UCS-2/UCS-4 
> data.

They could. Most likely, they don't. Explicit is better then implicit:
Anybody wishing to pass UCS-2 binary data to a function expecting
character strings should do

  function(unicode(data, "UCS-2BE")) # or LE if appropriate

> es# has logic in place which allows either copying the raw data
> to a buffer you provide or have it allocate a buffer of the
> right size for you. That's why I proposed to extend it support
> Unicode raw data as well.

Even though es# is cleanly defined, it is still undesirable to use,
IMO: it requires more copies of data than necessary. If explicit
memory management is required, it should be exposed through
Py_DECREF. That is easy to understand, and it allows to share
immutable objects, thus avoiding copies.


> > PyObject *Py_UnicodeOrString(PyObject *o, void *ignored)){
> >   if (PyUnicode_Check(o)){
> >     Py_INCREF(o);return o;
> >   }
> >   if (PyString_Check(o)){
> >     return PyUnicode_FromObject(o);
> >   }
> >   PyErr_SetString(PyExc_TypeError,"unicode object expecpected");
> >   return NULL;
> > }
> 
> Martin, note that PyUnicode_FromObject() already does the Unicode
> pass-through (even more: it makes sure that you get a true Unicode
> object, not a subclass).

I noticed. However, I'd like Py_UnicodeOrString to fail if you are not
passing a character string (and I'd see no problem in accepting
Unicode subtypes without copying them). This is a minor point, though
- I might have written

PyObject *Py_UnicodeOrString(PyObject *p, void* ignored){
  return PyObject_FromObject(o);
}

as well.

> Jack wants to get string and Unicode objects converted to Unicode 
> automagically and then receive a pointer to a Py_UNICODE buffer and
> a size. 
> 
> The current solution for this is to use the "O" parser,
> fetch the object, pass it through PyUnicode_FromObject(), then
> use PyUnicode_GET_SIZE() and PyUnicode_AS_UNICODE() to access
> the Py_UNICODE buffer and finally to Py_DECREF() the object returned
> by PyUnicode_FromObject().

That is the solution, although I would claim that using the O& parser
is simpler, and more flexible.

> What I proposed was to extend the "es#" parser marker with a new
> modifier: "eu#" which does all of the above except that it either
> copies the Py_UNICODE data to a buffer you provide or a newly
> allocated buffer which you then have to PyMem_Free() after usage.
> 
> How does this sound ?

Terrible. It copies a Unicode object without any need. It also adds to
the inflation of format specifiers for getargs; this inflation is
terrible in itself.

Regards,
Martin