Cult-like behaviour [was Re: Kindness]

Marko Rauhamaa marko at pacujo.net
Mon Jul 16 15:40:13 EDT 2018


Terry Reedy <tjreedy at udel.edu>:

> On 7/15/2018 5:28 PM, Marko Rauhamaa wrote:
>> if your new system used Python3's UTF-32 strings as a foundation,
>
> Since 3.3, Python's strings are not (always) UFT-32 strings.

You are right. Python's strings are a superset of UTF-32. More
accurately, Python's strings are UTF-32 plus surrogate characters.

> Nor are they always UCS-2 (or partly UTF-16) strings. Nor are the
> always Latin-1 or Ascii strings. Python's Flexible String
> Representation uses the narrowest possible internal code for any
> particular string. This is all transparent to the user except for
> memory size.

How CPython chooses to represent its strings internally is not what I'm
talking about.

>> UTF-32, after all, is a variable-width encoding.
>
> Nope.  It a fixed-width (32 bits, 4 bytes) encoding.
>
> Perhaps you should ask more questions before pontificating.

You mean each code point is one code point wide. But that's rather an
irrelevant thing to state. The main point is that UTF-32 (aka Unicode)
uses one or more code points to represent what people would consider an
individual character.

The letter "a" is encoded as a single code point, but 🇬🇧 (Flag, United
Kingdom) is two code points wide and 🏴 (Flag, England) is seven (!)
code points wide, not to forget 🧖‍♂️ (Man in Steamy Room) with four code
points. <URL: https://unicode.org/emoji/charts/full-emoji-list.html>

And of course, regular West-European letters can be represented by
multiple code points.

Code points are about as interesting as individual bytes in UTF-8.


Marko


More information about the Python-list mailing list