Add Unicode-aware str.reverse() function?

I wrote a blog post <http://paddy3118.blogspot.com/2009/07/case-of-disappearing-over-bar.html>nearly a decade ago on extending a Rosetta Code task example <http://rosettacode.org/wiki/Reverse_a_string>to handle the correct reversal of strings with combining characters. On checking my blog statistics today I found that it still had a readership and revisited the code <http://rosettacode.org/wiki/Reverse_a_string#Python:_Unicode_reversal> (and updated it to Python3.6).. I found that amongst the nearly 200 languages that complete the RC task,there were a smattering of languages that correctly handled reversing strings having Unicode combining characters, including Perl 6 <http://rosettacode.org/wiki/Reverse_a_string#Perl_6> which uses flip. I would like to propose that Python add a Unicode-aware *str.reverse *method. The problem is, I'm a Brit, who only speaks English and only very rarely dips into Unicode.* I don't know how useful this would be!* Cheers, Paddy.

On 08.09.2018 13:33, Paddy3118 wrote:
I've been using Unicode for quite a while and so far never had a need to reverse a string in real life. This sometimes comes up as coding challenge and perhaps in language classes as exercise, but I can hardly imaging a use case where we'd need a builtin method for this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 08 2018)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 9/8/2018 7:33 AM, Paddy3118 wrote:
The problem statement gives one Latin string example: "as⃝df̅" (combining circle between 's' and 'd') should be "f̅ds⃝a", not "̅fd⃝sa". Note that Thunderbird combines the overbar '\u0305' with 'f' but does *not* combine the ⃝ '\u20dd' with anything, because ⃝ does not have the 'combining' property.
Firefox garbles the problem statement by putting the following char, not the preceeding char, inside the circle. What is the 'correct reversal' of '\u301' or '\u301a'?
Your code raises IndexError on the strings above. If the intended domain of your function is 'all Python strings' (sequences of unicode codepoints), it is buggy. If the intended domain is some 'properly formed' subset of strings, then IndexError should be caught and replaced with ValueError('string starts with combining character'). Your code uses another latin string, Åström, as an example because it gives the 'incorrect' answer "̅fd⃝sa" for the reverse of "as⃝df̅".
At least Python is one that can do so, at least for latin chars.
I would like to propose that Python add a Unicode-aware *str.reverse *method.
A python string is a sequence of unicode codepoints. String methods operate on the string as such. We intentionally leave higher level methods to third parties. One reason is the problem of getting such things 'right' for all strings. What do we do with a leading combining char? Do combining characters always combine with the preceding char, as your code assumes? Do all languages treat all combining characters the same? (Pretty sure not.) Does .combining() encompass all order dependencies that should considered in a higher level reverse function? (According the the page you reference, no.) -- Terry Jan Reedy

Terry Ready wrote:
I've already mentioned Yannis Haralambous (in this thread). He's something of an expert on these matters. And also the author of Fonts & Encodings: From Advanced Typography to Unicode and Everything in Between http://shop.oreilly.com/product/9780596102425.do He's likely to know how to get things right for users for all (or at least many) languages and strings. I've let him know about this discussion. -- Jonathan

Thanks for your replies. After reading them,, although I seem to have a brain freeze at the moment and cannot think of an algorithm; I think it plausible, just in the ASCII world for someone to want to iterate through characters in a string in reverse order - maybe to zip with another existing iterable that would otherwise need to be reversed? If it shifted from ASCII to unicode then letters with their combining characters would have to be reversed as a single character; but also iterated over as a single unicode "character" - another problem! I thought so, I scratch the surface of unicode, and find a deep chasm awaits. On Saturday, 8 September 2018 12:33:07 UTC+1, Paddy3118 wrote:

On Sat, Sep 08, 2018 at 04:33:07AM -0700, Paddy3118 wrote:
I wouldn't care too much about a dedicated "reverse" method that handled combining characters. I think that's just a special case of iterating over graphemes. If we can iterate over graphemes, then reversing because trivial: ''.join(reversed(mystring.graphemes())) The Unicode Consortium offer an algorithm for identifying grapheme clusters in text strings, and there's at least three requests on the tracker (one closed, two open). https://bugs.python.org/issue30717 https://bugs.python.org/issue18406 https://bugs.python.org/issue12733 -- Steve

On 08.09.2018 13:33, Paddy3118 wrote:
I've been using Unicode for quite a while and so far never had a need to reverse a string in real life. This sometimes comes up as coding challenge and perhaps in language classes as exercise, but I can hardly imaging a use case where we'd need a builtin method for this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 08 2018)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

On 9/8/2018 7:33 AM, Paddy3118 wrote:
The problem statement gives one Latin string example: "as⃝df̅" (combining circle between 's' and 'd') should be "f̅ds⃝a", not "̅fd⃝sa". Note that Thunderbird combines the overbar '\u0305' with 'f' but does *not* combine the ⃝ '\u20dd' with anything, because ⃝ does not have the 'combining' property.
Firefox garbles the problem statement by putting the following char, not the preceeding char, inside the circle. What is the 'correct reversal' of '\u301' or '\u301a'?
Your code raises IndexError on the strings above. If the intended domain of your function is 'all Python strings' (sequences of unicode codepoints), it is buggy. If the intended domain is some 'properly formed' subset of strings, then IndexError should be caught and replaced with ValueError('string starts with combining character'). Your code uses another latin string, Åström, as an example because it gives the 'incorrect' answer "̅fd⃝sa" for the reverse of "as⃝df̅".
At least Python is one that can do so, at least for latin chars.
I would like to propose that Python add a Unicode-aware *str.reverse *method.
A python string is a sequence of unicode codepoints. String methods operate on the string as such. We intentionally leave higher level methods to third parties. One reason is the problem of getting such things 'right' for all strings. What do we do with a leading combining char? Do combining characters always combine with the preceding char, as your code assumes? Do all languages treat all combining characters the same? (Pretty sure not.) Does .combining() encompass all order dependencies that should considered in a higher level reverse function? (According the the page you reference, no.) -- Terry Jan Reedy

Terry Ready wrote:
I've already mentioned Yannis Haralambous (in this thread). He's something of an expert on these matters. And also the author of Fonts & Encodings: From Advanced Typography to Unicode and Everything in Between http://shop.oreilly.com/product/9780596102425.do He's likely to know how to get things right for users for all (or at least many) languages and strings. I've let him know about this discussion. -- Jonathan

Thanks for your replies. After reading them,, although I seem to have a brain freeze at the moment and cannot think of an algorithm; I think it plausible, just in the ASCII world for someone to want to iterate through characters in a string in reverse order - maybe to zip with another existing iterable that would otherwise need to be reversed? If it shifted from ASCII to unicode then letters with their combining characters would have to be reversed as a single character; but also iterated over as a single unicode "character" - another problem! I thought so, I scratch the surface of unicode, and find a deep chasm awaits. On Saturday, 8 September 2018 12:33:07 UTC+1, Paddy3118 wrote:

On Sat, Sep 08, 2018 at 04:33:07AM -0700, Paddy3118 wrote:
I wouldn't care too much about a dedicated "reverse" method that handled combining characters. I think that's just a special case of iterating over graphemes. If we can iterate over graphemes, then reversing because trivial: ''.join(reversed(mystring.graphemes())) The Unicode Consortium offer an algorithm for identifying grapheme clusters in text strings, and there's at least three requests on the tracker (one closed, two open). https://bugs.python.org/issue30717 https://bugs.python.org/issue18406 https://bugs.python.org/issue12733 -- Steve
participants (5)
-
Jonathan Fine
-
M.-A. Lemburg
-
Paddy3118
-
Steven D'Aprano
-
Terry Reedy