[Python-Dev] Python and the Unicode Character Database

M.-A. Lemburg mal at egenix.com
Mon Nov 29 16:19:19 CET 2010


Nick Coghlan wrote:
> On Mon, Nov 29, 2010 at 9:02 PM, M.-A. Lemburg <mal at egenix.com> wrote:
>> If we would go down that road, we would also have to disable other
>> Unicode features based on locale, e.g. whether to apply non-ASCII
>> case mappings, what to consider whitespace, etc.
>>
>> We don't do that for a good reason: Unicode is supposed to be
>> universal and not limited to a single locale.
> 
> Because parsing numbers is about more than just the characters used
> for the individual digits. There are additional semantics associated
> with digit ordering (for any number) and decimal separators and
> exponential notation (for floating point numbers) and those vary by
> locale. We deliberately chose to make the builtin numeric parsers
> unaware of all of those things, and assuming that we can simply parse
> other digits as if they were their ASCII equivalents and otherwise
> assume a C locale seems questionable.

Sure, and those additional semantics are locale dependent, even
between ASCII-only locales. However, that does not apply to the
basic building blocks, the decimal digits themselves.

> If the existing semantics can be adequately defined, documented and
> defended, then retaining them would be fine. However, the language
> reference needs to define the behaviour properly so that other
> implementations know what they need to support and what can be chalked
> up as being just an implementation accident of CPython. (As a point in
> the plus column, both decimal.Decimal and fractions.Fraction were able
> to handle the '١٢٣٤.٥٦' example in a manner consistent with the int
> and float handling)

The support is built into the C API, so there's not really much
surprise there.

Regarding documentation, we'd just have to add that numbers may
be made up of an Unicode code point in the category "Nd".

See http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf, section
4.6 for details....

"""
Decimal digits form a large subcategory of numbers consisting of those digits that can be
used to form decimal-radix numbers. They include script-specific digits, but exclude char-
acters such as Roman numerals and Greek acrophonic numerals. (Note that <1, 5> = 15 =
fifteen, but <I, V> = IV = four.) Decimal digits also exclude the compatibility subscript or
superscript digits to prevent simplistic parsers from misinterpreting their values in context.
"""

int(), float() and long() (in Python2) are such simplistic
parsers.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 29 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Python-Dev mailing list