[Python-ideas] Add Unicode-aware str.reverse() function?
tjreedy at udel.edu
Sat Sep 8 09:38:17 EDT 2018
On 9/8/2018 7:33 AM, Paddy3118 wrote:
> I wrote a blog post
> a decade ago on extending a Rosetta Code task example
> <http://rosettacode.org/wiki/Reverse_a_string>to handle the correct
> reversal of strings with combining characters.
The problem statement gives one Latin string example: "as⃝df̅"
(combining circle between 's' and 'd') should be "f̅ds⃝a", not "̅fd⃝sa".
Note that Thunderbird combines the overbar '\u0305' with 'f' but does
*not* combine the ⃝ '\u20dd' with anything, because ⃝ does not have the
>>> import unicodedata
Firefox garbles the problem statement by putting the following char, not
the preceeding char, inside the circle.
What is the 'correct reversal' of '\u301' or '\u301a'?
> On checking my blog statistics today I found that it still had a
> readership and revisited the code
> (and updated it to Python3.6)..
Your code raises IndexError on the strings above. If the intended
domain of your function is 'all Python strings' (sequences of unicode
codepoints), it is buggy. If the intended domain is some 'properly
formed' subset of strings, then IndexError should be caught and replaced
with ValueError('string starts with combining character').
Your code uses another latin string, Åström, as an example because
it gives the 'incorrect' answer "̅fd⃝sa" for the reverse of "as⃝df̅".
> I found that amongst the nearly 200 languages that complete the RC
> task,there were a smattering of languages that correctly handled
> reversing strings having Unicode combining characters,
At least Python is one that can do so, at least for latin chars.
> I would like to propose that Python add a Unicode-aware *str.reverse
A python string is a sequence of unicode codepoints. String methods
operate on the string as such. We intentionally leave higher level
methods to third parties. One reason is the problem of getting such
things 'right' for all strings. What do we do with a leading combining
char? Do combining characters always combine with the preceding char,
as your code assumes? Do all languages treat all combining characters
the same? (Pretty sure not.) Does .combining() encompass all order
dependencies that should considered in a higher level reverse function?
(According the the page you reference, no.)
Terry Jan Reedy
More information about the Python-ideas