[issue10542] Py_UNICODE_NEXT and other macros for surrogates

Wed Dec 29 18:36:17 CET 2010

Alexander Belopolsky <belopolsky at users.sourceforge.net> added the comment:

On Wed, Dec 29, 2010 at 7:19 AM, Marc-Andre Lemburg
<report at bugs.python.org> wrote:
..
> * The macros still need some more attention to enhance their performance.
>
Although I made your suggested change from '-' to '&', I seriously
doubt that this would make any difference on modern CPUs.  Why do you
think these macros are performance critical?  Users with lots of
supplementary characters in their files are probably better off with a
wide build where Py_UNICODE_NEXT() is just *ptr++ and can hardly be
further optimized.  Higher performance algorithms are possible, but
those should probably do some loop unrolling and/or process more than
one character at a time.  At this point, however it is too soon to
worry about optimization before we even know where these macros will
be used.

> * For consistency, I'd choose names Py_UNICODE_READ_NEXT()
>  and Py_UNICODE_WRITE_NEXT() instead of Py_UNICODE_NEXT() and
>  Py_UNICODE_PUT_NEXT().
>

I would leave it for you and Raymond to reach a consensus.  My
understanding is that Raymond does not want "next" in the name, so
your suggestion still conflicts with that.  I would mildly prefer
GET/PUT over READ/WRITE because the latter suggests multiple
characters.

As discussed before, the macro prefix does not imply the return value.
 Compare this to Py_UNICODE_ISSPACE() and friends or pretty much any
other Py_UNICODE_ macro.   Note that I added a leading underscore to
Py_UNICODE_JOIN_SURROGATES and other new macros, so there is no
immediate pressure to get the names perfect.

> * The macros need to be carefully documented, both in unicodeobject.h
>  and the general docs.
>

I've added a description above _Py_UNICODE_*NEXT macros.  I would
really like to see these macros in private use for a while before they
are published for general audience.  I'll add a comment describing
_Py_UNICODE_JOIN_SURROGATES.  The remaining macros seem to be fairly
self-explanatory (unlike, say Py_UNICODE_ISDIGIT or Py_UNICODE_ISTITLE
which are not documented in unicodeobject.h.)

Explicit downcasts would probably make sense, for example *(ptr)++ =
(Py_UNICODE)ch instead of *(ptr)++ = ch, but I don't think we need
explicit casts say in Py_UCS4 code = (ch) - 0x10000; where they can
mask coding errors.

I also looked for the use of casts elsewhere in unicodeobject.h and
the following does not look right:

#define Py_UNICODE_ISSPACE(ch) \
    ((ch) < 128U ? _Py_ascii_whitespace[(ch)] : _PyUnicode_IsWhitespace(ch))

It looks like this won't work right if ch is a signed char.

> * Same for your _Py_UNICODE_NEXT() to make sure that the return
>  value is indeed a Py_UNICODE value.
>

The return value of _Py_UNICODE_NEXT()  is *not* Py_UNICODE, it is
Py_UCS4 and as far as I can see, every conditional branch in narrow
case has an explicit cast.  In the wide case, I don't think we want an
explicit cast because ptr should already be Py_UCS4* and if it is not,
it may be a coding error that we don't want to mask.

> * In general, we should probably be clear on the allowed input
>  and define the output types in the documentation.

I agree.  I'll add a note that ptr and end should be Py_UNICODE*.  I
am not sure what we should say about ch argument.  If we add casts,
the macro will accept anything, but we should probably document it as
expecting Py_UCS4.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10542>
_______________________________________