Fwd: Add Unicode-aware str.reverse() function?

Op za 8 sep. 2018 13:33 schreef Paddy3118 <paddy3118@gmail.com>:
To be honest, quite apart from the Unicode issue, I never had a need to reverse a string in real code. .ytilibigel edepmi ot sdnet yllareneg tI Stephan

Stephan Houben wrote:
Sometimes we have to write 'backwards' to improve legibility. Odd though that may sound. Some languages are written from left to right. Some from right to left. And some ancient writing alternates, line to line. https://en.wikipedia.org/wiki/Right-to-left https://www.andiamo.co.uk/resources/right-left-languages https://en.wikipedia.org/wiki/Boustrophedon Users of modern rendering systems, such as in modern browsers, don't have to worry about this. This is because the renderer will handle LTR and RTL switches based on the language attribute. (Alway, text should be encoded in reading order.) But those implementing a bidectional rendering system might have to worry about such things. So what does that have to do with us, Python developers and users. According to the web: Arabic, Hebrew, Persian, and Urdu are the most widespread RTL writing systems in modern times. To provide legible localised (translated) help messages at the interactive Python interpreter, the system somewhere will have to correctly reverse Unicode strings, either before or after processing combining characters. There are about 422 million Arabic speakers, 110 million Persian, 5 million Hebrew and 100 million Urdu. Definitely worth doing, in my opinion. Otherwise the help message will look TO THEM like this: daer ot drah yrev si siht instead of this is very hard to read -- Jonathan

On 08.09.2018 15:00, Jonathan Fine wrote:
Most likely yes, but they would not render RTL text by first switching the direction and then printing them LTR again. Please also note that switching from LTR to RTL and back again is possible within a Unicode string, so applying str.reverse() would actually make things worse and not better :-) Processing in Unicode is always left to right, even if the resulting text may actually be rendered right to left or top to bottom. See UAX #9 for more details: http://www.unicode.org/reports/tr9/ Here's a document outlining how to render scripts which are LTR, RTL or TTB: https://www.w3.org/International/questions/qa-scripts -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 08 2018)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

M.-A. Lemburg wrote:
http://www.unicode.org/reports/tr9/ https://www.w3.org/International/questions/qa-scripts
Your reminder of the difficulties in Unicode, and the URLs are much appreciated. In particular, the keyword 'while' in Arabic should be written Left-To-Right, even though the ambient text is Left-To-Right. I've found these URLs, which suggests that there's a still a problem to be solved. https://www.linkedin.com/pulse/fix-rtl-right-left-support-persian-arabic-tex... https://askubuntu.com/questions/983480/showing-text-file-content-right-to-le... https://github.com/behdad/bicon My understanding is that at present it's not straightforward to provide legible localised text at the Python console, when the locale language is Arabic, Persian, Hebrew or Urdu. (And another problem is allow copy and paste of such text.) If it is straightforward to provide RTL localisation at the Python interpreter, I'd very much appreciate being pointed to such a solution. -- Jonathan

On Sat, Sep 8, 2018 at 11:41 PM, Jonathan Fine <jfine2358@gmail.com> wrote:
Generally, problems with RTL text are *display* problems, and are not solved by reversing strings. I've hardly ever needed to reverse a string, and when it does happen, it's generally for the sake of *parsing*. You reverse a string, parse it from left to right, then reverse the result, in order to do a "parse from the right" operation. Never done that in Python, because the situations where that's necessary are (a) rare, and (b) generally suitable for a regex anyway. Does anyone have a really complex parsing job that absolutely cannot be done with a regex, and benefits from reversal? ChrisA

Chris Angelico wrote:
Generally, problems with RTL text are *display* problems, and are not solved by reversing strings.
I very much agree with this statement, with one exception. If you wish to display RTL text on a LTR display, then a suitable reversing of strings is probably part of the solution. Using Google translate and a Python console I get English: integer Arabic: عدد صحيح # Copy and paste from Google translate into gmail Python: عدد صحيح The Python output shown here is obtained by 1. Copy and paste from Google translate into Python console. 2. And copied back again into gmail Notice how it looks just the same as the direct translation. So what's the problem? In the Python console, it doesn't look right. Here's what you should get. >>> 'عدد صحيح' '\xd8\xb9\xd8\xaf\xd8\xaf \xd8\xb5\xd8\xad\xd9\x8a\xd8\xad' See, the arabic is exactly the same. But when I paste the arabic string into the Python console, I get something that looks quite different, and sort of backwards. The problem, I think, is that the Python console is outputting the arabic glyphs from left to right. By the way, I get the same problem in the bash shell. -- Jonathan

On Sun, Sep 9, 2018 at 6:08 AM, Jonathan Fine <jfine2358@gmail.com> wrote:
That assumes that there is such a thing as an "LTR display". I disagree. :) There are "buggy displays" and there are "flawed displays" and there are "simplistic and naive displays", any or all of which could be limited in what they're able to render (and for the record, there's nothing inherently wrong with a simplistic display); but for those, Arabic text simply won't display correctly. RTL text is just one such problem (other examples include the way that different characters affect each other - an Arabic word is not the same as the abuttal of its individual characters - and the correct wrapping of text that uses joiners and spacers), and perfect Unicode display is *hard*. Improving a rendering engine or console so it's capable of correct RTL display is outside the scope of Python code, generally. ChrisA

Chris Angelico wrote:
Improving a rendering engine or console so it's capable of correct RTL display is outside the scope of Python code, generally.
I agree with you, generally. But there are over 600 million people who speak a RTL language. About 12% of the world's population. I'd like Python's command line console to work for them. It may be worth making a special effort, and breaking a general rule, here. But we'd have to think carefully about it, and have expert help. I'm beginning to think that, as well as (instead of?) IDLE, a browser based Python command line console might be a good idea. For example, I'm getting reasonable results from using https://brython.info/tests/editor.html?lang=en Perhaps RTL and LTR problems by themselves are not sufficient reason to make a browser-based IDLE. But they should be a significant influence. Something to think about. By the way, IDLE has the same problem. -- Jonathan

On Sun, Sep 9, 2018 at 6:55 AM, Jonathan Fine <jfine2358@gmail.com> wrote:
That's fine - but adding methods to Python won't change it. It's a console change, not a language change.
Have you tried out ipython / Jupyter? It might be what you're looking for. (I haven't tried it on this.) ChrisA

On 9/8/18 4:55 PM, Jonathan Fine wrote:
I would say that this shows that the problem isn't a need for a Unicode-aware string reverse, as that won't handle the problem (and is in someways the easiest part of the problem). The issue is that the string is quite likely a combination of LTR and RTL codes, so you perhaps want a functions to convert a Unicode string and process it so the requested glyphs are now all in a LTR order (perhaps even adding the override codes to the string so if the display DOES know how to handle RTL text knows it isn't supposed to change the order). Unicode is complicated, and one big question is how much support for its complexity should be built into the language and the basic types. Currently it is a fairly basic support (mostly just for codepoints). It could make sense to have a Unicode package that knows a lot more of the complexity of Unicode, doing things like extraction a code point package that represents a full glyph knowing all the combining rules, and maybe processing directional rendering like the above problem. -- Richard Damon

Stephan Houben wrote:
To be honest, quite apart from the Unicode issue, I never had a need to reverse a string in real code.
Yeah, seems to me it would only be useful if you were working on some kind of word game such as a palindrome generator, or if your string represents something other than natural language text (in which case all the tricky unicode stuff probably doesn't apply anyway). For such a rare requirement, maybe a module on PyPI would be a better solution than adding a string method. -- Greg

Stephan Houben wrote:
Sometimes we have to write 'backwards' to improve legibility. Odd though that may sound. Some languages are written from left to right. Some from right to left. And some ancient writing alternates, line to line. https://en.wikipedia.org/wiki/Right-to-left https://www.andiamo.co.uk/resources/right-left-languages https://en.wikipedia.org/wiki/Boustrophedon Users of modern rendering systems, such as in modern browsers, don't have to worry about this. This is because the renderer will handle LTR and RTL switches based on the language attribute. (Alway, text should be encoded in reading order.) But those implementing a bidectional rendering system might have to worry about such things. So what does that have to do with us, Python developers and users. According to the web: Arabic, Hebrew, Persian, and Urdu are the most widespread RTL writing systems in modern times. To provide legible localised (translated) help messages at the interactive Python interpreter, the system somewhere will have to correctly reverse Unicode strings, either before or after processing combining characters. There are about 422 million Arabic speakers, 110 million Persian, 5 million Hebrew and 100 million Urdu. Definitely worth doing, in my opinion. Otherwise the help message will look TO THEM like this: daer ot drah yrev si siht instead of this is very hard to read -- Jonathan

On 08.09.2018 15:00, Jonathan Fine wrote:
Most likely yes, but they would not render RTL text by first switching the direction and then printing them LTR again. Please also note that switching from LTR to RTL and back again is possible within a Unicode string, so applying str.reverse() would actually make things worse and not better :-) Processing in Unicode is always left to right, even if the resulting text may actually be rendered right to left or top to bottom. See UAX #9 for more details: http://www.unicode.org/reports/tr9/ Here's a document outlining how to render scripts which are LTR, RTL or TTB: https://www.w3.org/International/questions/qa-scripts -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Sep 08 2018)
::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/

M.-A. Lemburg wrote:
http://www.unicode.org/reports/tr9/ https://www.w3.org/International/questions/qa-scripts
Your reminder of the difficulties in Unicode, and the URLs are much appreciated. In particular, the keyword 'while' in Arabic should be written Left-To-Right, even though the ambient text is Left-To-Right. I've found these URLs, which suggests that there's a still a problem to be solved. https://www.linkedin.com/pulse/fix-rtl-right-left-support-persian-arabic-tex... https://askubuntu.com/questions/983480/showing-text-file-content-right-to-le... https://github.com/behdad/bicon My understanding is that at present it's not straightforward to provide legible localised text at the Python console, when the locale language is Arabic, Persian, Hebrew or Urdu. (And another problem is allow copy and paste of such text.) If it is straightforward to provide RTL localisation at the Python interpreter, I'd very much appreciate being pointed to such a solution. -- Jonathan

On Sat, Sep 8, 2018 at 11:41 PM, Jonathan Fine <jfine2358@gmail.com> wrote:
Generally, problems with RTL text are *display* problems, and are not solved by reversing strings. I've hardly ever needed to reverse a string, and when it does happen, it's generally for the sake of *parsing*. You reverse a string, parse it from left to right, then reverse the result, in order to do a "parse from the right" operation. Never done that in Python, because the situations where that's necessary are (a) rare, and (b) generally suitable for a regex anyway. Does anyone have a really complex parsing job that absolutely cannot be done with a regex, and benefits from reversal? ChrisA

Chris Angelico wrote:
Generally, problems with RTL text are *display* problems, and are not solved by reversing strings.
I very much agree with this statement, with one exception. If you wish to display RTL text on a LTR display, then a suitable reversing of strings is probably part of the solution. Using Google translate and a Python console I get English: integer Arabic: عدد صحيح # Copy and paste from Google translate into gmail Python: عدد صحيح The Python output shown here is obtained by 1. Copy and paste from Google translate into Python console. 2. And copied back again into gmail Notice how it looks just the same as the direct translation. So what's the problem? In the Python console, it doesn't look right. Here's what you should get. >>> 'عدد صحيح' '\xd8\xb9\xd8\xaf\xd8\xaf \xd8\xb5\xd8\xad\xd9\x8a\xd8\xad' See, the arabic is exactly the same. But when I paste the arabic string into the Python console, I get something that looks quite different, and sort of backwards. The problem, I think, is that the Python console is outputting the arabic glyphs from left to right. By the way, I get the same problem in the bash shell. -- Jonathan

On Sun, Sep 9, 2018 at 6:08 AM, Jonathan Fine <jfine2358@gmail.com> wrote:
That assumes that there is such a thing as an "LTR display". I disagree. :) There are "buggy displays" and there are "flawed displays" and there are "simplistic and naive displays", any or all of which could be limited in what they're able to render (and for the record, there's nothing inherently wrong with a simplistic display); but for those, Arabic text simply won't display correctly. RTL text is just one such problem (other examples include the way that different characters affect each other - an Arabic word is not the same as the abuttal of its individual characters - and the correct wrapping of text that uses joiners and spacers), and perfect Unicode display is *hard*. Improving a rendering engine or console so it's capable of correct RTL display is outside the scope of Python code, generally. ChrisA

Chris Angelico wrote:
Improving a rendering engine or console so it's capable of correct RTL display is outside the scope of Python code, generally.
I agree with you, generally. But there are over 600 million people who speak a RTL language. About 12% of the world's population. I'd like Python's command line console to work for them. It may be worth making a special effort, and breaking a general rule, here. But we'd have to think carefully about it, and have expert help. I'm beginning to think that, as well as (instead of?) IDLE, a browser based Python command line console might be a good idea. For example, I'm getting reasonable results from using https://brython.info/tests/editor.html?lang=en Perhaps RTL and LTR problems by themselves are not sufficient reason to make a browser-based IDLE. But they should be a significant influence. Something to think about. By the way, IDLE has the same problem. -- Jonathan

On Sun, Sep 9, 2018 at 6:55 AM, Jonathan Fine <jfine2358@gmail.com> wrote:
That's fine - but adding methods to Python won't change it. It's a console change, not a language change.
Have you tried out ipython / Jupyter? It might be what you're looking for. (I haven't tried it on this.) ChrisA

On 9/8/18 4:55 PM, Jonathan Fine wrote:
I would say that this shows that the problem isn't a need for a Unicode-aware string reverse, as that won't handle the problem (and is in someways the easiest part of the problem). The issue is that the string is quite likely a combination of LTR and RTL codes, so you perhaps want a functions to convert a Unicode string and process it so the requested glyphs are now all in a LTR order (perhaps even adding the override codes to the string so if the display DOES know how to handle RTL text knows it isn't supposed to change the order). Unicode is complicated, and one big question is how much support for its complexity should be built into the language and the basic types. Currently it is a fairly basic support (mostly just for codepoints). It could make sense to have a Unicode package that knows a lot more of the complexity of Unicode, doing things like extraction a code point package that represents a full glyph knowing all the combining rules, and maybe processing directional rendering like the above problem. -- Richard Damon

Stephan Houben wrote:
To be honest, quite apart from the Unicode issue, I never had a need to reverse a string in real code.
Yeah, seems to me it would only be useful if you were working on some kind of word game such as a palindrome generator, or if your string represents something other than natural language text (in which case all the tricky unicode stuff probably doesn't apply anyway). For such a rare requirement, maybe a module on PyPI would be a better solution than adding a string method. -- Greg
participants (6)
-
Chris Angelico
-
Greg Ewing
-
Jonathan Fine
-
M.-A. Lemburg
-
Richard Damon
-
Stephan Houben