Cult-like behaviour [was Re: Kindness]

Chris Angelico rosuav at gmail.com
Mon Jul 16 14:40:15 EDT 2018


On Tue, Jul 17, 2018 at 4:15 AM, Ian Kelly <ian.g.kelly at gmail.com> wrote:
> On Mon, Jul 16, 2018 at 12:02 PM Terry Reedy <tjreedy at udel.edu> wrote:
>>
>> On 7/15/2018 5:28 PM, Marko Rauhamaa wrote:
>>
>> > if your new system used Python3's UTF-32 strings as a foundation,
>>
>> Since 3.3, Python's strings are not (always) UFT-32 strings.  Nor are
>> they always UCS-2 (or partly UTF-16) strings.  Nor are the always
>> Latin-1 or Ascii strings.  Python's Flexible String Representation uses
>> the narrowest possible internal code for any particular string.  This is
>> all transparent to the user except for memory size.
>>
>> In 3.2 and before, Python's Unicode strings were either wide (UFT-32) or
>> narrow (UCS-2 + surrogates or UFT-16 minus full compliance).  The
>> difference was sometimes not transparent, and code that worked on one
>> build could fail on the other.  Since 3.3, string code should work the
>> same on any machines running the same Python version.
>>
>> > UTF-32, after all, is a variable-width encoding.
>>
>> Nope.  It a fixed-width (32 bits, 4 bytes) encoding.
>
> Although it only really uses 21 (actually, more like 20.087) of those
> bits. Given that and the similar naming, it's easy to see how people
> sometimes confuse its structure with UTF-8.

Yes, but that's on par with ASCII text putting seven bits' worth of
information into an eight-bit byte. UTF-32 still assigns four bytes
per codepoint, even though you could represent any Unicode character
with just 21 bits (or, as you say, a smidgen over twenty bits).

(Nobody's yet proposed a UTF-24, to my knowledge, even though it would
technically work. I suspect that either UTF-32 or UTF-8 would be
superior in any situation where UTF-24 might have been used.)

ChrisA


More information about the Python-list mailing list