Le jeudi 02 décembre 2010 à 13:14 -0500, Alexander Belopolsky a écrit :
I don't understand why you think Arabic or Hebrew text is any different from Western text. Surely right-to-left isn't more conceptually complicated than left-to-right, is it?
No, but a mix of LTR and RTL is certainly more difficult that either of the two. I invite you to digest Unicode Standard Annex #9 before we continue this discussion.
“This annex describes specifications for the *positioning* of characters flowing from right to left” (emphasis mine)
Looks like something for implementors of rendering engines, which python-dev is not AFAICT.
Same users may want to be able to cut and paste their decimals as well. More importantly, however, legacy formats may not have support for mixed-direction text and may require that "John is 41" be stored as "41 si nhoJ" and Unicode converter would turn it into "[RTL]John is 14" that will still display as "41 si nhoJ", but int(s[-2:]) will return 14, not 41.
The legacy format argument looks like a red herring to me. When converting from a format to another it is the programmer's job to his/her job right.
If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages.
So why do you trust the Unicode standard on other things and not on this one?
What other things?
Everything which the Unicode database stores and that we already rely on.
As far as I understand the only str method that was designed to comply with Unicode recomendations was str.isidentifier().
I don't think so. str.split() and str.splitlines() are also defined in conformance to the SPEC, AFAIK. They certainly try to. And, outside of str itself, the re module tries to follow Unicode categories as well (for example, "\d" should match non-ASCII digits).