[Python-Dev] Python and the Unicode Character Database

Mon Nov 29 20:23:28 CET 2010

On 11/29/2010 10:19 AM, M.-A. Lemburg wrote:
> Nick Coghlan wrote:
>> On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburg<mal at egenix.com>  wrote:
>>> If we would go down that road, we would also have to disable other
>>> Unicode features based on locale, e.g. whether to apply non-ASCII
>>> case mappings, what to consider whitespace, etc.
>>>
>>> We don't do that for a good reason: Unicode is supposed to be
>>> universal and not limited to a single locale.
>>
>> Because parsing numbers is about more than just the characters used
>> for the individual digits. There are additional semantics associated
>> with digit ordering (for any number) and decimal separators and
>> exponential notation (for floating point numbers) and those vary by
>> locale. We deliberately chose to make the builtin numeric parsers
>> unaware of all of those things, and assuming that we can simply parse
>> other digits as if they were their ASCII equivalents and otherwise
>> assume a C locale seems questionable.
>
> Sure, and those additional semantics are locale dependent, even
> between ASCII-only locales. However, that does not apply to the
> basic building blocks, the decimal digits themselves.
>
>> If the existing semantics can be adequately defined, documented and
>> defended, then retaining them would be fine. However, the language
>> reference needs to define the behaviour properly so that other
>> implementations know what they need to support and what can be chalked
>> up as being just an implementation accident of CPython. (As a point in
>> the plus column, both decimal.Decimal and fractions.Fraction were able
>> to handle the '١٢٣٤.٥٦' example in a manner consistent with the int
>> and float handling)
>
> The support is built into the C API, so there's not really much
> surprise there.
>
> Regarding documentation, we'd just have to add that numbers may
> be made up of an Unicode code point in the category "Nd".
>
> See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section
> 4.6 for details....
>
> """
> Decimal digits form a large subcategory of numbers consisting of those digits that can be
> used to form decimal-radix numbers. They include script-specific digits, but exclude char-
> acters such as Roman numerals and Greek acrophonic numerals. (Note that<1, 5>  = 15 =
> fifteen, but<I, V>  = IV = four.) Decimal digits also exclude the compatibility subscript or
> superscript digits to prevent simplistic parsers from misinterpreting their values in context.
> """
>
> int(), float() and long() (in Python2) are such simplistic
> parsers.

Since you are the knowledgable advocate of the current behavior, perhaps 
you could open an issue and propose a doc patch, even if not .rst formatted.

-- 
Terry Jan Reedy