Grapheme clusters, a.k.a.real characters
Ben Finney
ben+python at benfinney.id.au
Thu Jul 13 22:18:20 EDT 2017
Steve D'Aprano <steve+python at pearwood.info> writes:
> From time to time, people discover that Python's string algorithms work on code
> points rather than "real characters", which can lead to anomalies like the
> following:
>
> s = 'xäex'
> s = unicodedata.normalize('NFD', s)
> print(s)
> print(s[::-1])
>
>
> which results in:
>
> xäex
> xëax
> If you're interested in this issue
Note that it depends on the difference between two apparently identical
strings::
>>> s1 = 'xäex'
>>> s2 = unicodedata.normalize('NFD', s1)
>>> s1, s2
('xäex', 'xäex')
The strings are different, and the items you get when iterating them are
different::
>>> len(s1), len(s2)
(4, 5)
>>> [unicodedata.name(c) for c in s1]
['LATIN SMALL LETTER X',
'LATIN SMALL LETTER A WITH DIAERESIS',
'LATIN SMALL LETTER E',
'LATIN SMALL LETTER X']
>>> [unicodedata.name(c) for c in s2]
['LATIN SMALL LETTER X',
'LATIN SMALL LETTER A',
'COMBINING DIAERESIS',
'LATIN SMALL LETTER E',
'LATIN SMALL LETTER X']
which explains why they're different when reversed::
>>> [unicodedata.name(c) for c in reversed(s1)]
['LATIN SMALL LETTER X',
'LATIN SMALL LETTER E',
'LATIN SMALL LETTER A WITH DIAERESIS',
'LATIN SMALL LETTER X']
>>> "".join(reversed(s1))
'xeäx'
>>> [unicodedata.name(c) for c in reversed(s2)]
['LATIN SMALL LETTER X',
'LATIN SMALL LETTER E',
'COMBINING DIAERESIS',
'LATIN SMALL LETTER A',
'LATIN SMALL LETTER X']
>>> "".join(reversed(s2))
'xëax'
--
\ “I know that we can never get rid of religion …. But that |
`\ doesn’t mean I shouldn’t hate the lie of faith consistently and |
_o__) without apology.” —Paul Z. Myers, 2011-12-28 |
Ben Finney
More information about the Python-list
mailing list