[Python-ideas] [Python-Dev] Unicode minus sign in numeric conversions

Steven D'Aprano steve at pearwood.info
Thu Jun 13 05:15:31 CEST 2013


On 11/06/13 11:55, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
>
>   > It would be a pretty awful font that made ٠ look like .
>
> Or aging eyes.
>
>   > But even if it did, what is the concern here?  If somebody enters a
>   > mixed script number, presumably they have some reason for it.
>
> Unicode Technical Report #36 explains the concerns.  Mostly that the
> reason may be nefarious.  I specifically draw your attention to
> section 2.7:

Here's the URL:

http://www.unicode.org/reports/tr36/


>      2.7 Numeric Spoofs
>
>      Turning away from the focus on domain names for a moment, there is
>      another area where visual spoofs can be used. Many scripts have sets
>      of decimal digits that are different in shape from the typical
>      European digits. For example, Bengali has {০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯}, while
>      Oriya has {୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯}. Individual digits may have the same
>      shapes as digits from other scripts, even digits of different
>      values. For example, the Bengali string "৪୨" is visually confusable
>      with the European digits "89", but actually has the numeric value 42!
>    * If software interprets the numeric value of a string of digits without
>    * detecting that the digits are from different or inappropriate scripts,
>    * such spoofs can be used.
>
> Emphasis (*) added.  Noting that the number 42 is the answer to Life,
> the Universe, and Everything (including this thread), I conclude we're
> done!<wink/>


There is a vast gulf between "they look similar, and we're drawing this to your attention" and "here's an actual exploit". I'd be more impressed if they demonstrated a concrete exploit. Spoofing digits in a URL is a concrete exploit -- if you expect a URL like http://foo৪୨.com, then someone might be able to fool you into clicking http://foo89.com instead. That's a real risk, but not unique to Unicode. paypa1.com vs paypal.com anyone?

But coming up with a relevant exploit involving int() is harder. Earlier, Alexander Belopolsky wrote about potential vandalism of Wikipedia when screen-scraping data. Presumably he had something in mind like this:

# Actual data
"Average number of eggs eaten in a month = 89"

# Vandalised data:
"Average number of eggs eaten in a month = ৪୨"

And lo and behold, the vandal has succeeded in hiding the fact of his vandalism, provided the reader happens to be relatively unobservant and has font support for Bengali digits. And then the unsuspecting Python programmer scrapes the data, calls int(), and gets the value 42 instead of 89. The vandal's dastardly plan succeeds.

I suggest that this is rather more likely:

# Vandalised data:
"Average number of eggs eaten in a month = 42"


There are practical exploits where the bad guy can exploit the visual similarity of certain digits to other digits, but they doesn't have anything to do with int(). The Unicode consortium has done the right thing by mentioning this, but we can get a rough idea of the practical risk involved: there are about ten pages of discussion of various URL spoofing attacks, and six lines on numeric spoofs.



-- 
Steven


More information about the Python-ideas mailing list