Cult-like behaviour [was Re: Kindness]
rosuav at gmail.com
Mon Jul 16 16:07:33 EDT 2018
On Tue, Jul 17, 2018 at 5:40 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Terry Reedy <tjreedy at udel.edu>:
>> On 7/15/2018 5:28 PM, Marko Rauhamaa wrote:
>>> if your new system used Python3's UTF-32 strings as a foundation,
>> Since 3.3, Python's strings are not (always) UFT-32 strings.
> You are right. Python's strings are a superset of UTF-32. More
> accurately, Python's strings are UTF-32 plus surrogate characters.
>> Nor are they always UCS-2 (or partly UTF-16) strings. Nor are the
>> always Latin-1 or Ascii strings. Python's Flexible String
>> Representation uses the narrowest possible internal code for any
>> particular string. This is all transparent to the user except for
>> memory size.
> How CPython chooses to represent its strings internally is not what I'm
> talking about.
Then don't talk about UTF-32, which is a representation format.
>>> UTF-32, after all, is a variable-width encoding.
>> Nope. It a fixed-width (32 bits, 4 bytes) encoding.
>> Perhaps you should ask more questions before pontificating.
> You mean each code point is one code point wide. But that's rather an
> irrelevant thing to state. The main point is that UTF-32 (aka Unicode)
> uses one or more code points to represent what people would consider an
> individual character.
No, each code point is one code unit wide. It's not irrelevant.
> The letter "a" is encoded as a single code point, but 🇬🇧 (Flag, United
> Kingdom) is two code points wide and 🏴 (Flag, England) is seven (!)
> code points wide, not to forget 🧖♂️ (Man in Steamy Room) with four code
> points. <URL: https://unicode.org/emoji/charts/full-emoji-list.html>
> And of course, regular West-European letters can be represented by
> multiple code points.
> Code points are about as interesting as individual bytes in UTF-8.
Individual bytes in UTF-8 do not have individual meaning. Individual
code points do, with the partial exception of the flag characters
(which are pretty poorly supported anyway). Otherwise, every code
point is either a base character with general meaning, or a combining
character (or variant selector) that represents a specific change.
They can be composed in different ways. For example:
U+006F U+0301 "ó" LATIN SMALL LETTER O WITH ACUTE
U+006F U+030B "ő" LATIN SMALL LETTER O WITH DOUBLE ACUTE
U+0075 U+0301 "ú" LATIN SMALL LETTER U WITH ACUTE
U+0075 U+030B "ű" LATIN SMALL LETTER U WITH DOUBLE ACUTE
The UTF-8 representations of the combined forms of these characters are:
What does byte value C5 mean? What does 91 mean? None of these has
meaning on its own. The only way you can interpret them is as a full
set. In contrast, the combining characters have meaning: a base
character, or a combining character.
So, no, individual code points are very interesting.
More information about the Python-list