[Python-ideas] [Python-Dev] Unicode minus sign in numeric conversions

Alexander Belopolsky alexander.belopolsky at gmail.com
Sun Jun 9 23:34:19 CEST 2013


On Sun, Jun 9, 2013 at 5:07 PM, Andrew Barnert <abarnert at yahoo.com> wrote:

> On Jun 9, 2013, at 12:35, Alexander Belopolsky <
> alexander.belopolsky at gmail.com> wrote:
>
> ..
> If you do research using numerical data published on the web, you will be
> well advised not to assume that anything that looks like a number to your
> eye can be fed to python's float().
>
>
> That's good general advice, but what's the specific advice in this case?
> You want data from a Wikipedia page, you've looked at it and verified that
> what looks like -123.45 actually is that float, even though the first
> character is a Unicode minus, so… you should write your own parser, or at
> least explicitly call x.replace('\N{MINUS SIGN}', '-')) before you can feed
> x to float (or a numpy array constructor, or whatever)?
>


My specific advise would be to use a parser that would reject anything
other than well-formatted numbers according to the specs for this
particular data source.  That parser should definitely reject non-ascii
digits and possibly even reject  ascii '-' because that may be an
indication of vandalism.

Note that python float() is a wrong choice for this task regardless of what
we decide to do with '\N{MINUS SIGN}', but if we make float() more
promiscuous, it will become more likely that it will be used naively with
data scrubbed from web pages.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130609/6e3bfa3b/attachment.html>


More information about the Python-ideas mailing list