[New-bugs-announce] [issue10581] Review and document string format accepted in numeric data type constructors
report at bugs.python.org
Mon Nov 29 18:55:58 CET 2010
New submission from Alexander Belopolsky <belopolsky at users.sourceforge.net>:
I am opening a new report to continue work on the issues raised in #10557 that are either feature requests or documentation bugs.
The rest is my reply to the relevant portions of Marc's comment at msg122785.
On Mon, Nov 29, 2010 at 4:41 AM, Marc-Andre Lemburg <report at bugs.python.org> wrote:
> Alexander Belopolsky wrote:
>> Alexander Belopolsky <belopolsky at users.sourceforge.net> added the comment:
>> After a bit of svn archeology, it does appear that Arabic-Indic
>> digits' support was deliberate at least in the sense that the
>> feature was tested for when the code was first committed. See r15000.
> As I mentioned on python-dev (http://mail.python.org/pipermail/python-dev/2010-November/106077.html)
> this support was added intentionally.
>> The test migrated from file to file over the last 10 years, but it
>> is still present in test_float.py:
>> self.assertEqual(float(b" \u0663.\u0661\u0664 ".decode('raw-unicode-escape')), 3.14)
>> (It should probably be now rewritten using a string literal.)
>> For the future, I note that starting with Unicode 6.0.0,
>> the Unicode Consortium promises that
>> Characters with the property value Numeric_Type=de (Decimal) only
>> occur in contiguous ranges of 10 characters, with ascending numeric
>> values from 0 to 9 (Numeric_Value=0..9).
>> This makes it very easy to check a numeric string does not contain
>> a mix of digits from different scripts.
> I'm not sure why you'd want to check for such ranges.
In order to disallow a mix of say Arabic-Indic and Bengali digits. Such combinations cannot be defended as possibly valid numbers in any script.
>> I still believe that proper API should require explicit choice of
>> language or locale before allowing digits other than 0-9 just as
>> int() would not accept hexadecimal digits without explicit choice of
>> base >= 16. But this would be a subject of a feature request.
> Since when do we require a locale or language to be specified when
> using Unicode ?
This is a valid question. I may be in minority, but I find it convenient to use int(), float() etc. for data validation. If my program gets a CSV file with Arabic-Indic digits, I want to fire the guy who prepared it before it causes real issues. :-) I may be too strict, but I don't think anyone would want to see columns with a mix of Bengali and Devanagari numerals.
On the other hand there is certain convenience in promiscuous parsers, but this is not an expectation that I have from int() and friends. int('0xFF') requires me to specify base even though 0xFF is a perfectly valid notation.
There are pros and cons in any approach. Let's figure out what is better before we fix the documentation.
> The codecs, Unicode methods and other Unicode support features
> happily work with all kinds of languages, mixed or not, without any
> such specification.
In my view int() and friends are only marginally related to Unicode and Unicode methods design is not directly relevant to their behavior. If we were designing str.todigits(), by all means, I would argue that it must be consistent with str.isdigit(). For numeric data, however, I think we should follow the logic that rejected int('0xFF').
This is my opinion. We can consider allowing int('0xFF') as well. Let's discuss.
components: Documentation, Interpreter Core
nosy: belopolsky, eric.smith, ezio.melotti, haypo, lemburg, mark.dickinson, skrah
title: Review and document string format accepted in numeric data type constructors
type: feature request
versions: Python 3.3
Python tracker <report at bugs.python.org>
More information about the New-bugs-announce