[issue12729] Python lib re cannot handle Unicode properly due to narrow/wide bug
Terry J. Reedy
report at bugs.python.org
Mon Aug 15 02:26:53 CEST 2011
Terry J. Reedy <tjreedy at udel.edu> added the comment:
Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. They support non-BMP chars but only partially, because, BY DESIGN*, indexing and len are by code units, not codepoints. They are documented as being UCS-2 because that is what M-A Lemburg, the original designer and writer of Python's unicode type and the unicode-capable re module, wants them to be called. The link to msg142037, which is one of 50+ in the thread (and many or most other disagree), pretty well explains his viewpoint. The positive side is that we deliver more than we promise. The negative side is that by not promising what perhaps we should allows is not to deliver what perhaps we should.
*While I think this design decision may have been OK a decade ago for a first implementation of an *optional* text type, I do not think it so for the future for revised implementations of what is now *the* text type. I think narrow builds can and should be revised and upgraded to index, slice, and measure by codepoints. Here is my current idea:
If the code unit stream contains any non-BMP characters (ie, surrogate pair of 16-bit code units), construct a sequence of *indexes* of such characters (pairs). The fixed length of the string in codepoints is n-k, where n is the number of code units (the current length) and k is the length of the auxiliary sequence and the number of pairs. For indexing, look up the character index in the list of indexes by binary search and increment the codepoint index by the index of the index found to get the corresponding code unit index. (I have omitted the details needed avoid off-by-1 errors.)
This would make indexing O(log(k)) when there are surrogates. If that is really a problem because k is a substantial fraction of a 'large' n, then one should use a wide build. By using a separate internal class, there would be no time or space penalty for all-BMP text. I will work on a prototype in Python.
PS: The OSCON link in msg142036 currently gives me 404 not found
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list