[Cython] Py_UNICODE* string support

Sun Mar 3 14:40:54 CET 2013

On Sun, 03 Mar 2013 15:32:36 +0600, Stefan Behnel <stefan_ml at behnel.de>  
wrote:

> 1) I would like to get rid of UnicodeConst. A Py_UNICODE* is not  
> different
> from any other C array, except that it can coerce to and from Unicode
> strings. So the representation of a literal should be a (properly
> reference
> counted) Python Unicode object, and users would be allowed to cast them
> to <Py_UNICODE*>, just as we support it for <char*> and bytes.

I understand the idea. Since Python unicode literals are implicitly
coercible to Py_UNICODE*, there appears to be no need for C-level
Py_UNICODE[] literals. Indeed, client code will look exactly (!) the same
whether they are supported or not.

Except when it comes to nogil. (For example, native callbacks are almost
guaranteed to be nogil.) Hiding Python operations in what appears to be
pure C-level code will break users' assumptions.
This is #1 reason why I went for C-level literals. #2 reason is efficiency
on Py3.3. C-level literals don't need conversions and don't call any  
conversion APIs.

> 2) non-BMP literals should be supported by representing them as normal
> Unicode strings and creating the Py_UNICODE representation at need (i.e.
> explicitly through a cast, at runtime). Py_UNICODE[] literals are simply
> not portable.

Py_UNICODE[] literals can be made fully portable if non-BMP ones are  
wrapped
like this:

    #ifdef Py_UNICODE_WIDE
    static const k_xxx[] = { <UTF-32 array without surrogates>, 0 };
    #else
    static const k_xxx[] = { <UTF-16 array with surrogates>, 0 };
    #endif

Literals containing only BMP chars are already portable and don't need
this wrapping.

> 3) __Pyx_Py_UNICODE_strlen() is ok, but only for the special case that  
> all we have is a Py_UNICODE*. As long as we are dealing with Unicode  
> string
> objects, that won't be needed, so len() should be constant time in the
> normal case instead of linear time.

len(Py_UNICODE*) simply mirrors len(char*). Its putpose is to provide
platform-independent Py_UNICODE_strlen (which is Py3 only and deprecated  
in 3.3).

> So, the basic idea would be to use Unicode strings and their (optional)
> internal representation as Py_UNICODE[] instead of making Py_UNICODE[] a
> first class data type. And then go from there and optimise certain things
> to use the unpacked array directly, so that users won't need to put
> explicit C-API calls into their code.

Please reconsider your decision wrt C-level literals.
I believe that nogil code and a bit of efficiency (on 3.3) justify their
existence. (char* literals do have C-level literals, Py_UNICODE* is in
the same basket when it comes to Windows code).
The code to support them is also small and well-contained.
I've updated my pull request to fully support for non-BMP Py_UNICODE[]  
literals.

If you are still not convinced, so be it, I'll drop C-level literal  
support.

Best regards,
Nikita Nemkin

PS. I made a false claim in the previous mail. (Some of) Python's wchar_t  
APIs
do exist in Py2. But they won't manage the memory automatically anyway.