Grapheme clusters, a.k.a.real characters
Terry Reedy
tjreedy at udel.edu
Fri Jul 14 17:12:10 EDT 2017
On 7/14/2017 10:30 AM, Michael Torrie wrote:
> On 07/14/2017 07:31 AM, Marko Rauhamaa wrote:
>> Of course, UTF-8 in a bytes object doesn't make the situation any
>> better, but does it make it any worse?
>
>>
>> As it stands, we have
>>
>> รจ --[encode>-- Unicode --[reencode>-- UTF-8
>>
>> Why is one encoding format better than the other?
All digital data are ultimately bits, usually collected together in
groups of 8, called bytes. The point of python 3 is that text should
normally be instances of a text class, separate from the raw bytes
class, with a defined internal encoding. The actual internal encoding
is secondary. And it changed in 3.3.
Python ints are encoded bytes, so are floats, and everything else. When
one prints a float, one certainly does not see a representation of the
raw bytes in the float object. Instead, one sees a representation of
the value it represents. There is a proposal to change the internal
encoding of int, as least on 64-bit machines, which are now standard.
However, because print(87987282738472387429748) prints
87987282738472387429748 and not the internal bytes, the change in the
internal bytes will not affect the user view of ints.
> This is precisely the logic behind Google using UTF-8 for strings in Go,
> rather than having some O(1) abstract type like Python has. And many
> other languages do the same. The argument is that because of the very
> issues that you mention, having O(1) lookup in a string isn't that
> important, since looking up a particular index in a unicode string is
> rarely the right thing to do, so UTF-8 is just fine as a native,
> in-memory type.
Does go use bytes for text, like most people did in Python 2, a separate
text string class, that hides the internal encoding format and
implementation? In other words, if you do the equivalent of print(s)
where s is a text string with a mixture of greek, cyrillic, hindi,
chinese, japanese, and korean chars, do you see the characters, or some
representation of the internal bytes?
--
Terry Jan Reedy
More information about the Python-list
mailing list