[issue10542] Py_UNICODE_NEXT and other macros for surrogates

Wed Aug 17 12:07:22 CEST 2011

Marc-Andre Lemburg <mal at egenix.com> added the comment:

STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner at haypocalc.com> added the comment:
> 
> Le 17/08/2011 07:04, Ezio Melotti a écrit :
>> As I said in msg142175 I think the Py_UNICODE_IS{HIGH|LOW|}SURROGATE and Py_UNICODE_JOIN_SURROGATES can be committed without trailing _ in 3.3 and with trailing _ in 2.7/3.2.  They should go in unicodeobject.h

Ezio used two different naming schemes in his email. Please always
use Py_UNICODE_... or _Py_UNICODE (not PyUNICODE_ or _PyUNICODE_).

> For Python 2.7 and 3.2, I would prefer to not touch a public header, and 
> so add the macros in unicodeobject.c.

Why would you want to touch Python 2.7 at all ?

>> and be public in 3.3+.
> 
> If you want to make my HIGH_SURROGATE and LOW_SURROGATE macros public, 
> they will use to substract 0x10000 themself (whereas my macros require 
> the ordinal to be preproceed).

This can be done by having two definitions of the macros: one set for
UCS2 builds and one for UCS4.

>>   * _Py_UNICODE_NEXT and _Py_UNICODE_PUT_NEXT are useful, so once we have agreed about the name they can go in.  They can be private in all the 3 branches and made public in 3.4 if they work well;
> 
> Note: I don't think that _Py_UNICODE*NEXT should go into Python 2.7 or 3.2.

Certainly not into Python 2.7. Adding macros in patch level releases is
also not such a good idea.

>>   * IS_NONBMP doesn't simplify much the code but makes it more readable.  ICU has U_IS_BMP, but in most of the cases we want to check for non-BMP, so if we add this macro it might be ok to check for non-BMP;
> 
> If you want to make it public, it's better to call it PyUNICODE_IS_BMP() 
> (check if the argument is in U+0000-U+FFFF).

Py_UNICODE_IS_BMP() please.

>>   * I'm not sure HIGH_SURROGATE/LOW_SURROGATE are useful with _Py_UNICODE_NEXT.  If they are they should get a better name because the current one is not clear about what they do.
> 
> They are still useful for UTF-16 encoders (to UTF-16-LE/BE and 16-bit 
> wchar_t*). We can keep HIGH_SURROGATE and LOW_SURROGATE private in 
> unicodeobject.c.
>
>> Unless someone disagrees I'll prepare a patch with PyUNICODE_IS_{HIGH_|LOW_|}SURROGATE and Py_UNICODE_JOIN_SURROGATES for unicodeobject.h, using them where necessary, using with Victor implementation and commit it (after a review).
> 
> Cool. I suppose that you mean PyUNICODE_JOIN_SURROGATES (not 
> Py_UNICODE_JOIN_SURROGATES). I used the verb "combine", taken from a 
> comment in unicodeobject.c. "combine" is maybe better than "join"?

No, Py_UNICODE_... please !

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________
2011-10-04: PyCon DE 2011, Leipzig, Germany                48 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10542>
_______________________________________