[Python-Dev] len(chr(i)) = 2?

Thu Nov 25 05:37:33 CET 2010

On Wed, Nov 24, 2010 at 9:17 PM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
..
>  > I note that an opinion has been raised on this thread that
>  > if we want compressed internal representation for strings, we should
>  > use UTF-8.  I tend to agree, but UTF-8 has been repeatedly rejected as
>  > too hard to implement.  What makes UTF-16 easier than UTF-8?  Only the
>  > fact that you can ignore bugs longer, in my view.
>
> That's mostly true.  My guess is that we can probably ignore those
> bugs for as long as it takes someone to write the higher-level
> libraries that James suggests and MAL has actually proposed and
> started a PEP for.
>

As far as I can tell, that PEP generated grand total of one comment in
nine years.  This may or may not be indicative of how far away we are
from seeing it implemented.  :-)

As far as UTF-8 vs. UCS-2/4 debate, I have an idea that may be even
more far fetched.  Once upon a time, Python Unicode strings supported
buffer protocol and would lazily fill an internal buffer with bytes in
the default encoding.  In 3.x the default encoding has been fixed as
UTF-8, buffer protocol support was removed from strings, but the
internal buffer caching (now UTF-8) encoded representation remained.
Maybe we can now implement defenc logic in reverse.  Recall that
strings are stored as UCS-2/4 sequences, but once buffer is requested
in 2.x Python code or char* is obtained via
_PyUnicode_AsStringAndSize() at the C level in 3.x, an internal buffer
is filled with UTF-8 bytes and  defenc is set to point to that buffer.
  So the idea is for strings to store their data as UTF-8 buffer
pointed by defenc upon construction.  If an application uses string
indexing, UTF-8 only strings will lazily fill their UCS-2/4 buffer.
Proper, Unicode-aware algorithms such as grapheme, word or line
iteration or simple operations such as concatenation, search or
substitution would operate directly on defenc buffers.  Presumably
over time fewer and fewer applications would use code unit indexing
that require UCS-2/4 buffer and eventually Python strings can stop
supporting indexing altogether just like they stopped supporting the
buffer protocol in 3.x.