
On 7/2/20 10:19 AM, Victor Stinner wrote:
Do you mean UTF-16 and UTF-32? UTF-16 supports the whole Unicode character set but uses the annoying surrogate pairs for characters outside the BMP.*
Minor quibble, UTF-16 handles all of the CURRENTLY defined Unicode set, and there is a currently a promise not to extend Unicode past that, but at some point they may need to break that promise. UTF-8, as previously defined (and could be again) easily handles U+00000000 to U+7FFFFFFF. UTF-16 can handle via the surrogate pairs U+00000000 to U+0010FFFF and stop there, To extend past that would require some form of heroics, which is the reason that U+0010FFFF is currently defined as the highest possible code point, as to allow a higher value breaks UTF-16, and there currently isn't a desire to do so. At some point in the distant future, we may run out of 'valid' code points, and this promise will need to be broken. UTF-16 grew out of a need to fix what has become UCS-2, which is the encoding used for earlier Unicode standards, before the need for code points above U+0000FFFF (now the BMP) was seen. -- Richard Damon