[Python-ideas] [Python-Dev] Unicode minus sign in numeric conversions

Mon Jun 10 00:15:09 CEST 2013

On Jun 9, 2013, at 14:34, Alexander Belopolsky <alexander.belopolsky at gmail.com> wrote:

> 
> On Sun, Jun 9, 2013 at 5:07 PM, Andrew Barnert <abarnert at yahoo.com> wrote:
>> On Jun 9, 2013, at 12:35, Alexander Belopolsky <alexander.belopolsky at gmail.com> wrote:
>>> ..
>>> If you do research using numerical data published on the web, you will be well advised not to assume that anything that looks like a number to your eye can be fed to python's float().
>> 
>> That's good general advice, but what's the specific advice in this case? You want data from a Wikipedia page, you've looked at it and verified that what looks like -123.45 actually is that float, even though the first character is a Unicode minus, so… you should write your own parser, or at least explicitly call x.replace('\N{MINUS SIGN}', '-')) before you can feed x to float (or a numpy array constructor, or whatever)?
> 
> 
> My specific advise would be to use a parser that would reject anything other than well-formatted numbers according to the specs for this particular data source.

Seriously? That's going to be a couple orders of magnitude slower, and much, much more complicated (and therefore buggy) than just calling float. Even if you need validation for your use case, it's a lot simpler, and faster, to validate then call float, than to parse it manually.

And the obvious definition of success for this code is that it returns the same thing that validate-and-float would.

> That parser should definitely reject non-ascii digits and possibly even reject  ascii '-' because that may be an indication of vandalism.

Except that wikipedia doesn't transition all at once, and never will. There are pages with each minus sign, and even pages with both minus signs. Readers, except for a few zealots, don't care about the difference. People writing scrapers, except scrapers used as tools for helping the transition, don't care either.

Do you really think that every time a Wikipedia page deviates from current (often recently-changed) standards, that's evidence of vandalism, and therefore all information on that page should be ignored?

And even sites that aren't continuously edited will have similar cases—e.g., all pages created before the flag day have one minus sign, those created after have the other—and possibly a few fall a couple days on the wrong side of the line because they were already in processing when the changeover happened.

> Note that python float() is a wrong choice for this task regardless of what we decide to do with '\N{MINUS SIGN}',

Why? Maybe you want Decimal instead of float, but then the same arguments apply there. Otherwise, in what way is float wrong for parsing floating point string representations into numbers?

> but if we make float() more promiscuous, it will become more likely that it will be used naively with data scrubbed from web pages.

Which makes it more likely that people will write those programs, which work, instead of failing to write anything.

It's like arguing that BeautifulSoup is bad because it allows you to write HTML scraping code without understanding and dealing with the total HTML structure. It's not bad—besides allowing novices to write scraping code at all, it also allows experienced developers to write scraping code with less effort and fewer bugs.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130609/1b2bb50b/attachment.html>