[Python-ideas] [Python-Dev] Unicode minus sign in numeric conversions

Wed Jun 12 07:30:55 CEST 2013

From: Stephen J. Turnbull <stephen at xemacs.org>

Sent: Tuesday, June 11, 2013 9:04 PM

> Andrew Barnert writes:

You're replying out of context. I realize the thread is long and twisty, so let me summarize so you (and others) don't have to delve through all of it:

MRAB suggested that maybe int and friends shouldn't do transliteration at all; it would be better to have a new "translate_number" function ("somewhere", not necessarily in builtins,but presumably in the stdlib). After a few issues were raised, he suggested that may this should be moved to "a form of encoding/decoding, though locale-specific forms. You'd decode on input and encode on output."

In my email that you replied to, I was agreeing with that idea, but pointing out some examples that seem underspecified, and may be hard to specify.

>>  If 二万三十, 2万3十, and 20030 all decode to 20030 from Japanese
> 
> Only the third does in a non-locale-specific way.  The characters for
> "man" and "juu" have numeric values, but not decimal ones.

I brought up Japanese as one of my original examples specifically because it has common numeric forms that aren't decimal. That's part of the reason MRAB suggested that this is locale-specific functionality, not builtin functionality, which I agreed with.

The point of reusing this example is that there are three _different_ such forms in one locale. Locale-decoding all of them to '20030' is easy, but locale-encoding '20030' is then a problem. Is there a relevant standard which says which of the three forms is canonical? If not, how do we decide what the encode function does? The same is true for the other examples I raised (how do you encode scientific format to Oriya, etc.).

>> In some cases, the answer is probably just "don't do that".
> 
> I think (for builtins) that's the answer for all non-ASCII cases.<0.5 
> wink/>

In case it's not obvious: The point of adding locale-specific encode/decode functions is that builtins like int and float can then just deal with the traditional ASCII cases.

My question is whether those locale-specific functions are well-specified. If they are, I'm +1 on adding them and keeping the builtins minimal. If they aren't, I'm -1 on trying to invent a brand-new specification for numeric representations as part of the Python stdlib.