On Thu, Dec 2, 2010 at 8:36 AM, Antoine Pitrou firstname.lastname@example.org wrote:
On Wed, 1 Dec 2010 22:28:49 -0500 Alexander Belopolsky email@example.com wrote:
This matches my limited research on this topic as well. However, I am not sure that when these codes are embedded in Arabic text, their logical order always matches their display order.
That shouldn't matter, since unicode text follows logical order. The display order is up to the graphical representation library.
I am not so sure. On my Mac, U+200F (RIGHT-TO-LEFT MARK) affects 0-9 and Arabic-Indic decimals differently:
I replaced Arabic-Indic decimals with 0-9 in the output to demonstrate the point. Cut-n-paste does not work well in the presence of RTL directives.
and U+202E (RIGHT-TO-LEFT OVERRIDE) reverts the display order for both:
(again, the output display is simulated not copied.) I don't know if explicit RTL directives are ever used in Arabic texts, but it is quite possible that texts converted from older formats would use them for efficiency.
Note that my point is not to find the correct answer here, but to demonstrate that we as a group don't have the expertise to get parsing of Arabic text right. If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages. Nobody will ever assume that python builtins are suitable for use with all these variants. This "feature" is only good for nefarious purposes such as hiding extra digits in innocent-looking files or smuggling binary data through naive interfaces.
PS: BTW, shouldn't int('\u0661\u0662\u06DD') be valid? or is it int('\u06DD\u0661\u0662')?