[Python-ideas] [Python-Dev] Unicode minus sign in numeric conversions
MRAB
python at mrabarnett.plus.com
Tue Jun 11 04:21:45 CEST 2013
On 11/06/2013 02:55, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
>
> > It would be a pretty awful font that made ٠ look like .
>
> Or aging eyes.
>
> > But even if it did, what is the concern here? If somebody enters a
> > mixed script number, presumably they have some reason for it.
>
> Unicode Technical Report #36 explains the concerns. Mostly that the
> reason may be nefarious. I specifically draw your attention to
> section 2.7:
>
> 2.7 Numeric Spoofs
>
> Turning away from the focus on domain names for a moment, there is
> another area where visual spoofs can be used. Many scripts have sets
> of decimal digits that are different in shape from the typical
> European digits. For example, Bengali has {০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, while
> Oriya has {୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}. Individual digits may have the same
> shapes as digits from other scripts, even digits of different
> values. For example, the Bengali string "৪୨" is visually confusable
> with the European digits "89", but actually has the numeric value 42!
> * If software interprets the numeric value of a string of digits without
> * detecting that the digits are from different or inappropriate scripts,
> * such spoofs can be used.
>
> Emphasis (*) added. Noting that the number 42 is the answer to Life,
> the Universe, and Everything (including this thread), I conclude we're
> done!<wink/>
>
In that case, float and int should accept different scripts, but not
mixed scripts.
> > Postel's Law, or the Robustness Principle, supports the current
> > behaviour: "Be conservative in what you send, be liberal in what
> > you accept". str(number) is conservative, and emits only ASCII
> > digits. int(string) and float(string) are liberal, and accept any
> > valid digit as a digit.
>
> The Postel Principle may apply to Python as a whole; I believe it
> does. But not every input with a plausible interpretation needs to be
> acceptable to *builtins*. For example, as with "universal newlines"
> we could have "universal decimal points", accepting any of . , ' as
> dividing the integer part from the fractional part. This would be
> unambiguous, since Python numbers do not admit grouping characters.
> Your version of the Postel Principle suggests that this is a strong
> candidate for addition to the float() builtin. WDYT?
>
> The builtins are in any case poorly suited for input conversion, since
> they should not be localized.
>
I think that it would be best to trial it on PyPI first!
More information about the Python-ideas
mailing list