On Sep 14, 2004, at 2:54 AM, Terry Reedy wrote:
This is why I am not especially enamored of Unicode and the prospect of Python becoming married to it. It is heavily weighted in favor of efficiently representing Chinese and inefficiently representing English. To give English equivalent treatment, the 20,000 or so most common words, roots, prefixes, and suffixes would each get its own codepoint.
Of course it is perfectly possible to have the Python unicode implementation choose to represent some unicode strings with only 8 bits per character. There is no (conceptual) reason it could not represent (u'a' * 8) with 8 bytes + class header overhead. That is simply an implementation detail and really has nothing to do with Unicode itself. It would also be possible to use UTF-8 string storage, although this has the tradeoff that indexing an element takes linear time w.r.t. position instead of constant time. James