Unicode is not UTF-32 [was Re: Cult-like behaviour]
steve+comp.lang.python at pearwood.info
Mon Jul 16 20:58:09 EDT 2018
On Mon, 16 Jul 2018 22:40:13 +0300, Marko Rauhamaa wrote:
> Terry Reedy <tjreedy at udel.edu>:
>> On 7/15/2018 5:28 PM, Marko Rauhamaa wrote:
>>> if your new system used Python3's UTF-32 strings as a foundation,
>> Since 3.3, Python's strings are not (always) UFT-32 strings.
> You are right. Python's strings are a superset of UTF-32. More
> accurately, Python's strings are UTF-32 plus surrogate characters.
The first thing you are doing wrong is conflating the semantics of the
data type with one possible implementation of that data type. UTF-32 is
implementation, not semantics: it specifies how to represent Unicode code
points as bytes in memory, not what Unicode code points are.
Python 3 strings are sequences of abstract characters ("code points")
with no mandatory implementation. In CPython, some string objects are
encoded in Latin-1. Some are encoded in UTF-16. Some are encoded in
UTF-32. Some implementations (MicroPython) use UTF-8.
Your second error is a more minor point: it isn't clear (at least not to
me) that "Unicode plus surrogates" is a superset of Unicode. Surrogates
are part of Unicode. The only extension here is that Python strings are
not necessarily well-formed surrogate-free Unicode strings, but they're
still Unicode strings.
>> Nor are they always UCS-2 (or partly UTF-16) strings. Nor are the
>> always Latin-1 or Ascii strings. Python's Flexible String
>> Representation uses the narrowest possible internal code for any
>> particular string. This is all transparent to the user except for
>> memory size.
> How CPython chooses to represent its strings internally is not what I'm
> talking about.
Then why do you repeatedly talk about the internal storage representation?
UTF-32 is not a character set, it is an encoding. It specifies how to
implement a sequence of Unicode abstract characters.
>>> UTF-32, after all, is a variable-width encoding.
>> Nope. It a fixed-width (32 bits, 4 bytes) encoding.
>> Perhaps you should ask more questions before pontificating.
> You mean each code point is one code point wide. But that's rather an
> irrelevant thing to state.
No, he means that each code point is one code unit wide.
> The main point is that UTF-32 (aka Unicode)
UTF-32 is not a synonym for Unicode. Many legacy encodings don't
distinguish between the character set and the mapping between bytes and
characters, but Unicode is not one of those.
> uses one or more code points to represent what people would consider an
> individual character.
That's a reasonable observation to make. But that's not what fixed- and
variable-width refers to.
So does ASCII, and in both cases, it is irrelevant since the term of art
is to define fixed- and variable-width in terms of *code points* not
human meaningful characters. "Character" is context- and language-
dependent and frequently ambiguous. "LL" or "CH" (for example) could be a
single character or a double character, depending on context and language.
Even in ASCII English, something as large as "ough" might be considered
to be a single unit of language, which some people might choose to call a
character. (But not a single letter, naturally.) If you don't like that
example, "qu" is probably a better one: aside from acronyms and loan
words, no modern English word can fail to follow a Q with a U.
> Code points are about as interesting as individual bytes in UTF-8.
That's your opinion. I see no justification for it.
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson
More information about the Python-list