[Python-Dev] Python and the Unicode Character Database

Sat Dec 4 21:29:41 CET 2010

On Fri, Dec 3, 2010 at 12:10 AM, Alexander Belopolsky
<alexander.belopolsky at gmail.com> wrote:
..
> I don't think decimal module should support non-European decimal
> digits.  The only place where it can make some sense is in int()
> because here we have a fighting chance of producing a reasonable
> definition.   The motivating use case is conversion of numerical data
> extracted from text using simple '\d+'  regex matches.
>

It turns out, this use case does not quite work in Python either:

>>> re.compile(r'\s+(\d+)\s+').match(' \u2081\u2082\u2083   ').group(1)
'₁₂₃'
>>> int(_)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'decimal' codec can't encode character '\u2081' in
position 0: invalid decimal Unicode string

This may actually be a bug in Python regex implementation because
Unicode standard seems to recommend that '\d' be interpreted as gc =
Decimal_Number (Nd):

http://unicode.org/reports/tr18/#Compatibility_Properties

I actually wonder if Python's re module can claim to provide even
Basic Unicode Support.

http://unicode.org/reports/tr18/#Basic_Unicode_Support

> Here is how I would do it:
>
> 1.  String x of non-European decimal digits is only accepted in
> int(x), but not by int(x, 0) or int(x, 10).
> 2.  If x contains one or more non-European digits, then
>
>    (a)  all digits must be from the same block:
>
>      def basepoint(c):
>            return ord(c) - unicodedata.digit(c)
>      all(basepoint(c) == basepoint(x[0]) for c in x) -> True
>
>     (b) and '+' or '-' sign is not alowed.
>
> 3. A character c is a digit if it matches '\d' regex.  I think this
> means unicodedata.category(c) -> 'Nd'.
>
> Condition 2(b) is important because there is no clear way to define
> what is acceptable as '+' or '-' using Unicode character properties
> and not all number systems even support local form of negation.  (It
> is also YAGNI.)
>