On 11/29/2010 10:19 AM, M.-A. Lemburg wrote:
Nick Coghlan wrote:
On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburgmal@egenix.com wrote:
If we would go down that road, we would also have to disable other Unicode features based on locale, e.g. whether to apply non-ASCII case mappings, what to consider whitespace, etc.
We don't do that for a good reason: Unicode is supposed to be universal and not limited to a single locale.
Because parsing numbers is about more than just the characters used for the individual digits. There are additional semantics associated with digit ordering (for any number) and decimal separators and exponential notation (for floating point numbers) and those vary by locale. We deliberately chose to make the builtin numeric parsers unaware of all of those things, and assuming that we can simply parse other digits as if they were their ASCII equivalents and otherwise assume a C locale seems questionable.
Sure, and those additional semantics are locale dependent, even between ASCII-only locales. However, that does not apply to the basic building blocks, the decimal digits themselves.
If the existing semantics can be adequately defined, documented and defended, then retaining them would be fine. However, the language reference needs to define the behaviour properly so that other implementations know what they need to support and what can be chalked up as being just an implementation accident of CPython. (As a point in the plus column, both decimal.Decimal and fractions.Fraction were able to handle the '١٢٣٤.٥٦' example in a manner consistent with the int and float handling)
The support is built into the C API, so there's not really much surprise there.
Regarding documentation, we'd just have to add that numbers may be made up of an Unicode code point in the category "Nd".
See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section 4.6 for details....
""" Decimal digits form a large subcategory of numbers consisting of those digits that can be used to form decimal-radix numbers. They include script-specific digits, but exclude char- acters such as Roman numerals and Greek acrophonic numerals. (Note that<1, 5> = 15 = fifteen, but<I, V> = IV = four.) Decimal digits also exclude the compatibility subscript or superscript digits to prevent simplistic parsers from misinterpreting their values in context. """
int(), float() and long() (in Python2) are such simplistic parsers.
Since you are the knowledgable advocate of the current behavior, perhaps you could open an issue and propose a doc patch, even if not .rst formatted.