[issue10542] Py_UNICODE_NEXT and other macros for surrogates
report at bugs.python.org
Sat Nov 27 23:03:20 CET 2010
Marc-Andre Lemburg <mal at egenix.com> added the comment:
I like the idea and thanks for putting work into this.
* when using macro variables, always put the variables in parens in the expansion; this avoids precedence issues, weird syntax errors, etc. - even if it may not be necessary
* a function would be cleaner, but since this code is very performance sensitive, I'd opt for the macro version, unless someone can prove that a function would be just as fast in benchmarks
* the macros should be documented in the unicodeobject.h header file and clearly mention that ptr and end should be side-effect free and that ptr must an lvalue
* please use the faster bitmask operators for joining surrogates, i.e. ucs4 = ((((high & 0x03FF) << 10) | (low & 0x03FF)) + 0x00010000);
* the Py_UNICODE_JOIN_SURROGATES() macro should use Py_UCS4 as prefix since it returns Py_UCS4 values, i.e. Py_UCS4_JOIN_SURROGATES()
* same for the Py_UNICODE_NEXT() macro, i.e. Py_UCS4_NEXT()
* in order to make the macro easier to understand, please rename it to Py_UCS4_READ_CODE_POINT(); that's a little more typing, but still a lot less than without the macro :-)
* this version should be slightly faster and is also easier to read:
#define Py_UCS4_READ_CODE_POINT(ptr, end) \
((Py_UNICODE_ISHIGHSURROGATE((ptr)) && \
(ptr) < (end) && \
Py_UNICODE_ISLOWSURROGATE((ptr))) ? \
Py_UNICODE_JOIN_SURROGATES((ptr)++, (ptr)++) : \
I haven't tested it, but you get the idea.
BTW: You only focus on UCS2 builds. Please also make sure that these changes work on UCS4 builds, e.g. Py_UCS2_READ_CODE_POINT() will also work on UCS4 builds and join code points there.
Note that UCS4 builds currently don't join surrogates, so a high and low surrogates appear as two code points, which they are, but given the experience with UCS2 builds, may not be what the user expects. So for the purpose of consistency we should be careful with auto-joining surrogates in UCS2.
It does make sence for ord() and the various string methods, but should be done with care in other cases.
In any case, we should clearly document where these macros are used and warn about the implications of using them in the wrong places.
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list