[Python-ideas] Add Unicode-aware str.reverse() function?

Terry Reedy tjreedy at udel.edu
Sat Sep 8 09:38:17 EDT 2018


On 9/8/2018 7:33 AM, Paddy3118 wrote:
> I wrote a blog post 
> <http://paddy3118.blogspot.com/2009/07/case-of-disappearing-over-bar.html>nearly 
> a decade ago on extending a Rosetta Code task example 
> <http://rosettacode.org/wiki/Reverse_a_string>to handle the correct 
> reversal of strings with combining characters.

The problem statement gives one Latin string example: "as⃝df̅" 
(combining circle between 's' and 'd') should be "f̅ds⃝a", not "̅fd⃝sa". 
  Note that Thunderbird combines the overbar '\u0305' with 'f' but does 
*not* combine the ⃝ '\u20dd' with anything, because ⃝ does not have the 
'combining' property.

 >>> import unicodedata
 >>> unicodedata.combining('\u20dd')
0

Firefox garbles the problem statement by putting the following char, not 
the preceeding char, inside the circle.

What is the 'correct reversal' of '\u301' or '\u301a'?

> On checking my blog statistics today I found that it still had a 
> readership and revisited the code 
> <http://rosettacode.org/wiki/Reverse_a_string#Python:_Unicode_reversal> 
> (and updated it to Python3.6)..

Your code raises IndexError on the strings above.  If the intended 
domain of your function is 'all Python strings' (sequences of unicode 
codepoints), it is buggy.  If the intended domain is some 'properly 
formed' subset of strings, then IndexError should be caught and replaced 
with ValueError('string starts with combining character').

Your code uses another latin string, Åström, as an example because
it gives the 'incorrect' answer "̅fd⃝sa" for the reverse of "as⃝df̅".

> I found that amongst the nearly 200 languages that complete the RC 
> task,there were a smattering of languages that correctly handled 
> reversing strings having Unicode combining characters,

At least Python is one that can do so, at least for latin chars.

> I would like to propose that Python add a Unicode-aware *str.reverse 
> *method.

A python string is a sequence of unicode codepoints.  String methods 
operate on the string as such.  We intentionally leave higher level 
methods to third parties.  One reason is the problem of getting such 
things 'right' for all strings.  What do we do with a leading combining 
char?  Do combining characters always combine with the preceding char, 
as your code assumes?  Do all languages treat all combining characters 
the same?  (Pretty sure not.)  Does .combining() encompass all order 
dependencies that should considered in a higher level reverse function? 
(According the the page you reference, no.)


-- 
Terry Jan Reedy




More information about the Python-ideas mailing list