[Python-Dev] Python and the Unicode Character Database

Thu Dec 2 17:41:11 CET 2010

On Thu, Dec 2, 2010 at 8:36 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
> On Wed, 1 Dec 2010 22:28:49 -0500
> Alexander Belopolsky <alexander.belopolsky at gmail.com> wrote:
..
>> This matches my limited research on this topic as well.  However, I am
>> not sure that when these codes are embedded in Arabic text, their
>> logical order always matches their display order.
>
> That shouldn't matter, since unicode text follows logical order. The
> display order is up to the graphical representation library.
>

I am not so sure.  On my Mac, U+200F (RIGHT-TO-LEFT MARK) affects 0-9
and Arabic-Indic decimals differently:

>>> print('\u200F123')
‏123
>>> print('\u200F\u0661\u0662\u0663')
231

I replaced Arabic-Indic decimals with 0-9 in the output to demonstrate
the point.  Cut-n-paste does not work well in the presence of RTL
directives.

and U+202E (RIGHT-TO-LEFT OVERRIDE) reverts the display order for both:

>>> print('\u202E123')
321
>>> print('\u202E\u0661\u0662\u0663')
321

(again, the output display is simulated not copied.)  I don't know if
explicit RTL directives are ever used in Arabic texts, but it is quite
possible that texts converted from older formats would use them for
efficiency.

Note that my point is not to find the correct answer here, but to
demonstrate that we as a group don't have the expertise to get parsing
of Arabic text right.  If we've got it right for Arabic, it is by
chance and not by design.  This still leaves us with 41 other types of
digits for at least 30 different languages.  Nobody will ever assume
that python builtins are suitable for use with all these variants.
This "feature" is only good for nefarious purposes such as hiding
extra digits in innocent-looking files or smuggling binary data
through naive interfaces.

PS: BTW, shouldn't int('\u0661\u0662\u06DD') be valid? or is it
int('\u06DD\u0661\u0662')?