Grapheme clusters, a.k.a.real characters
Terry Reedy
tjreedy at udel.edu
Sat Jul 15 02:27:22 EDT 2017
On 7/14/2017 9:20 PM, Steve D'Aprano wrote:
> On Sat, 15 Jul 2017 07:12 am, Terry Reedy wrote:
>
>> Does go use bytes for text, like most people did in Python 2, a separate
>> text string class, that hides the internal encoding format and
>> implementation? In other words, if you do the equivalent of print(s)
>> where s is a text string with a mixture of greek, cyrillic, hindi,
>> chinese, japanese, and korean chars, do you see the characters, or some
>> representation of the internal bytes?
>
> The answer is, its complicated.
>
> Go has two string types: "strings", and "runes".
>
> Strings are equivalent to Python 3 byte-strings, except that the language is
> biased towards assuming they are UTF-8 instead of Python 3's decision to assume
> they are ASCII. In other words, if you display a Python 3 byte-string, it will
> display bytes that represent ASCII characters as ASCII, and everything else
> escaped as a hex byte:
>
> py> b'\x41\xcf\x80\x5a'
> b'A\xcf\x80Z'
>
> Go does the same, except it will display anything which is legal UTF-8 (which
> may be 1, 2, 3, or 4 bytes) as a Unicode character (actually code point).
> Assuming your environment is capable of displaying that character, otherwise
> you'll just see a square, or some other artifact.
>
> So if Python used the same rules as Go, the above byte-string would display as:
>
> b'AπZ'
>
> Most of the time, when processing strings, Go treats them as arbitrary bytes,
> although Go comes with libraries that help make it easier to work with them as
> UTF-8 byte strings.
>
> Runes, on the other hand, are a strict superset of Unicode. Runes are strings of
> 32-bit code units, so like UTF-32 except not limited to the Unicode range of
> \U00000000 through \U0010FFFF. Runes will accept any 32 bit values up to
> 0xFFFFFFFF.
>
> I presume that runes which fall within the UTF-32 range will be displayed as the
> Unicode character where possible, and those which fall outside of that range as
> some sort of hex display.
>
> So Go strings are like Python byte strings, biased towards UTF-8 but with no
> guarantees made, and Go runes are a superset of Python text strings.
>
> Does that answer your question sufficiently?
>
> https://blog.golang.org/strings
Yes, thank you.
--
Terry Jan Reedy
More information about the Python-list
mailing list