[Python-ideas] [Python-Dev] Unicode minus sign in numeric conversions

MRAB python at mrabarnett.plus.com
Tue Jun 11 04:21:45 CEST 2013


On 11/06/2013 02:55, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
>
>   > It would be a pretty awful font that made ٠ look like .
>
> Or aging eyes.
>
>   > But even if it did, what is the concern here?  If somebody enters a
>   > mixed script number, presumably they have some reason for it.
>
> Unicode Technical Report #36 explains the concerns.  Mostly that the
> reason may be nefarious.  I specifically draw your attention to
> section 2.7:
>
>      2.7 Numeric Spoofs
>
>      Turning away from the focus on domain names for a moment, there is
>      another area where visual spoofs can be used. Many scripts have sets
>      of decimal digits that are different in shape from the typical
>      European digits. For example, Bengali has {০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, while
>      Oriya has {୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}. Individual digits may have the same
>      shapes as digits from other scripts, even digits of different
>      values. For example, the Bengali string "৪୨" is visually confusable
>      with the European digits "89", but actually has the numeric value 42!
>    * If software interprets the numeric value of a string of digits without
>    * detecting that the digits are from different or inappropriate scripts,
>    * such spoofs can be used.
>
> Emphasis (*) added.  Noting that the number 42 is the answer to Life,
> the Universe, and Everything (including this thread), I conclude we're
> done!<wink/>
>
In that case, float and int should accept different scripts, but not
mixed scripts.

>   > Postel's Law, or the Robustness Principle, supports the current
>   > behaviour: "Be conservative in what you send, be liberal in what
>   > you accept". str(number) is conservative, and emits only ASCII
>   > digits. int(string) and float(string) are liberal, and accept any
>   > valid digit as a digit.
>
> The Postel Principle may apply to Python as a whole; I believe it
> does.  But not every input with a plausible interpretation needs to be
> acceptable to *builtins*.  For example, as with "universal newlines"
> we could have "universal decimal points", accepting any of . , ' as
> dividing the integer part from the fractional part.  This would be
> unambiguous, since Python numbers do not admit grouping characters.
> Your version of the Postel Principle suggests that this is a strong
> candidate for addition to the float() builtin.  WDYT?
>
> The builtins are in any case poorly suited for input conversion, since
> they should not be localized.
>
I think that it would be best to trial it on PyPI first!



More information about the Python-ideas mailing list