[issue10521] str methods don't accept non-BMP fillchar on a narrow Unicode build

Ezio Melotti report at bugs.python.org
Thu Nov 25 07:28:50 CET 2010


Ezio Melotti <ezio.melotti at gmail.com> added the comment:

I think that methods like str.isalpha can and should be fixed. Since _PyUnicode_IsAlpha now accepts a Py_UCS4, the body of unicode_isalpha can be changed to convert normal chars and surrogates pairs to a Py_UCS4 before calling Py_UNICODE_ISALPHA.
The attached patch is a proof of concept of this approach and returns True for '\N{OLD ITALIC LETTER A}'.isalpha() on a narrow build.
It still has a number of issues that should be addressed (check for narrow builds, check for lone surrogates, check for high surrogate at the end of a string, fix compiler warnings ...) but it should be good enough as a PoC.

I would also suggest to introduce a set of macros to handle surrogates (e.g. detect, combine) and use it in all the functions that need to work with them.

----------
keywords: +patch
Added file: http://bugs.python.org/file19809/issue10521-isalpha.diff

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue10521>
_______________________________________


More information about the Python-bugs-list mailing list