[Python-Dev] Python and the Unicode Character Database

M.-A. Lemburg mal at egenix.com
Fri Dec 3 00:01:24 CET 2010


Terry Reedy wrote:
> On 11/29/2010 10:19 AM, M.-A. Lemburg wrote:
>> Nick Coghlan wrote:
>>> On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburg<mal at egenix.com>  wrote:
>>>> If we would go down that road, we would also have to disable other
>>>> Unicode features based on locale, e.g. whether to apply non-ASCII
>>>> case mappings, what to consider whitespace, etc.
>>>>
>>>> We don't do that for a good reason: Unicode is supposed to be
>>>> universal and not limited to a single locale.
>>>
>>> Because parsing numbers is about more than just the characters used
>>> for the individual digits. There are additional semantics associated
>>> with digit ordering (for any number) and decimal separators and
>>> exponential notation (for floating point numbers) and those vary by
>>> locale. We deliberately chose to make the builtin numeric parsers
>>> unaware of all of those things, and assuming that we can simply parse
>>> other digits as if they were their ASCII equivalents and otherwise
>>> assume a C locale seems questionable.
>>
>> Sure, and those additional semantics are locale dependent, even
>> between ASCII-only locales. However, that does not apply to the
>> basic building blocks, the decimal digits themselves.
>>
>>> If the existing semantics can be adequately defined, documented and
>>> defended, then retaining them would be fine. However, the language
>>> reference needs to define the behaviour properly so that other
>>> implementations know what they need to support and what can be chalked
>>> up as being just an implementation accident of CPython. (As a point in
>>> the plus column, both decimal.Decimal and fractions.Fraction were able
>>> to handle the '١٢٣٤.٥٦' example in a manner consistent with the int
>>> and float handling)
>>
>> The support is built into the C API, so there's not really much
>> surprise there.
>>
>> Regarding documentation, we'd just have to add that numbers may
>> be made up of an Unicode code point in the category "Nd".
>>
>> See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section
>> 4.6 for details....
>>
>> """
>> Decimal digits form a large subcategory of numbers consisting of those
>> digits that can be
>> used to form decimal-radix numbers. They include script-specific
>> digits, but exclude char-
>> acters such as Roman numerals and Greek acrophonic numerals. (Note
>> that<1, 5>  = 15 =
>> fifteen, but<I, V>  = IV = four.) Decimal digits also exclude the
>> compatibility subscript or
>> superscript digits to prevent simplistic parsers from misinterpreting
>> their values in context.
>> """
>>
>> int(), float() and long() (in Python2) are such simplistic
>> parsers.
> 
> Since you are the knowledgable advocate of the current behavior, perhaps
> you could open an issue and propose a doc patch, even if not .rst
> formatted.

Good suggestion. I tried to collect as much context as possible:

http://bugs.python.org/issue10610

I'll leave the rst-magic to someone else, but will certainly help
if you have more questions about the details.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Dec 02 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Python-Dev mailing list