Grapheme clusters, a.k.a.real characters
Marko Rauhamaa
marko at pacujo.net
Sat Jul 15 03:50:54 EDT 2017
Steve D'Aprano <steve+python at pearwood.info>:
> On Sat, 15 Jul 2017 04:10 am, Marko Rauhamaa wrote:
>> Python3's strings don't give me any better random access than UTF-8.
>
> Say what? Of course they do.
>
> Python 3 strings (since 3.3) are a compact form of UTF-32. Without loss of
> generality, we can say that each string is an array of four-byte code units.
Yes, and a UTF-8 byte array gives me random access to the UTF-8
single-byte code units.
Neither gives me random access to the "Grapheme clusters, a.k.a.real
characters". For example, the HFS+ file system stores uses a variant of
NFD for filenames meaning both UTF-32 and UTF-8 give you random access
to pure ASCII filenames only.
> UTF-8 is not: it is a variable-width encoding,
UTF-32 is a variable-width encoding as well. For example, "baby: medium
skin tone" is U+1F476 U+1F3FD:
<URL: http://unicode.org/emoji/charts/full-emoji-list.html#1f476_1f3fd>
> Go ignores this problem by simply not offering random access to code
> points in strings.
Random access to code points is as uninteresting as random access to
UTF-8 bytes.
I might want random access to the "Grapheme clusters, a.k.a.real
characters". As you have pointed out, that wish is impossible to grant
unambiguously.
Marko
More information about the Python-list
mailing list