[Tutor] three numbers for one

Steven D'Aprano steve at pearwood.info
Tue Jun 11 01:38:58 CEST 2013


On 10/06/13 22:55, Oscar Benjamin wrote:

> Yes, but I thought it was using a different unambiguous and easier
> (for me) to understand definition of decimal digits.

It's no easier. You have a list (in your head) of characters that are decimal digits. In your head, you have ten of them, because you are an English speaker and probably have never learned any language that uses other digits, or used DOS codepages or Windows charsets with alternate versions of digits (such as the East Asian full width and narrow width forms). You probably *have* used charsets like Latin-1 containing ¹²³ but probably not often enough to think about them as potentially digits. But either way, in your head you have a list of decimal digits.

Python also has a list of decimal digits, except it is longer.

(Strictly speaking, it probably doesn't keep an explicit list "these chars are digits" in memory, but possibly looks them up in a Unicode property database as needed. But then, who knows how memories and facts are stored in the human brain? Strictly speaking, there's probably no list in your head either.)


>  I guess that I'm
> just coming to realise exactly what Python 3's unicode support really
> means and in many cases it means that the interpreter is doing things
> that I don't want or need.

With respect, it's not that you don't want or need them, but that you don't *know* that you actually do want and need them. (I assume you are releasing software for others to use. If all your software is private, for your own use and nobody else, then you may not care.) If your software accepts numeric strings from the user -- perhaps it reads a file, perhaps it does something like this:

number = int(input("Please enter a number: "))

-- you want it to do the right thing when the user enters a number. Thanks to the Internet, your program is available to people all over the world. Well, in probably half the world, those digits are not necessarily the same as ASCII 0-9. Somebody downloads your app in Japan, points it at a data file containing fullwidth or halfwidth digits, and in Python 3 it just works. (Provided, of course, that you don't sabotage its ability to do so with inappropriate decimal only data validation.)


> For example I very often pipe streams of ascii numeric text from one
> program to another.

No you don't. Never. Not once in the history of computers has anyone ever piped streams of text from one program to another. They just *think* they have.

They pipe *bytes* from one program to another.



> In some cases the cost of converting to/from
> decimal is actually significant and Python 3 will add to this both
> with a more complex conversion

Let your mind be at rest on this account. Python 3.3 int() is nearly twice as fast as Python 2.7 for short strings:

[steve at ando ~]$ python2.7 -m timeit "int('12345')"
1000000 loops, best of 3: 0.924 usec per loop
[steve at ando ~]$ python3.3 -m timeit "int('12345')"
1000000 loops, best of 3: 0.485 usec per loop


and about 25% faster for long strings:

[steve at ando ~]$ python2.7 -m timeit "int('1234567890'*5)"
100000 loops, best of 3: 2.06 usec per loop
[steve at ando ~]$ python3.3 -m timeit "int('1234567890'*5)"
1000000 loops, best of 3: 1.45 usec per loop


It's a little slower when converting the other way:

[steve at ando ~]$ python2.7 -m timeit "str(12345)"
1000000 loops, best of 3: 0.333 usec per loop
[steve at ando ~]$ python3.3 -m timeit "str(12345)"
1000000 loops, best of 3: 0.5 usec per loop

but for big numbers, the difference is negligible:

[steve at ando ~]$ python2.7 -m timeit -s "n=1234567890**5" "str(n)"
1000000 loops, best of 3: 1.12 usec per loop
[steve at ando ~]$ python3.3 -m timeit -s "n=1234567890**5" "str(n)"
1000000 loops, best of 3: 1.16 usec per loop

and in any case, the time taken to convert to a string is trivial.


> and with its encoding/decoding part of
> the io stack. I'm wondering whether I should really just be using
> binary mode for this kind of thing in Python 3 since this at least
> removes an unnecessary part of the stack.

I'm thinking that you're engaging in premature optimization. Have you profiled your code to confirm that the bottlenecks are where you think they are?


> In a previous thread where I moaned about the behaviour of the int()
> function Eryksun suggested that it would be better if int() wan't used
> for parsing strings at all. Since then I've thought about that and I
> agree. There should be separate functions for each kind of string to
> number conversion with one just for ascii decimal only.

I think that is a terrible, terrible idea. It moves responsibility for something absolutely trivial ("convert a string to a number") from the language to the programmer, *who will get it wrong*.

# The right way:
number = int(string)


# The wrong, horrible, terrible way (and buggy too):
try:
     number = ascii_int(string)
except ValueError:
     try:
         number = fullwidth_int(string)
     except ValueError:
         try:
             number = halfwidth_int(string)
         except ValueError:
             try:
                 number = thai_int(string)
             except ...
             # and so on, for a dozen or so other scripts...
             # oh gods, think of the indentation!!!

             except ValueError:
                  # Maybe it's a mixed script number?
                  # Fall back to char by char conversion.
                  n = 0
                  for c in string:
                      if c in ascii_digits:
                          n += ord(c) - ord('0')
                      elif c in halfwidth_digits:
                          n += ord(c) - ord('\N{HALFWIDTH DIGIT ZERO}'
                      elif ... # and so forth
                      else:
                          raise ValueError


Of course, there are less stupid ways to do this. But you don't have to, because it already works.



>>> An alternative method depending on where your strings are actually
>>> coming from would be to use byte-strings or the ascii codec. I may
>>> consider doing this in future; in my own applications if I pass a
>>> non-ascii digit to int() then I definitely have data corruption.
>>
>> It's not up to built-ins like int() to protect you from data corruption.
>> Would you consider it reasonable for me to say "in my own applications, if I
>> pass a number bigger than 100, I definitely have data corruption, therefore
>> int() should not support numbers bigger than 100"?
>
> I expect the int() function to reject invalid input.

It does. What makes you think it doesn't?


> I thought that its definition of invalid matched up with my own.

If your definition is something other than "a string containing non-digits, apart from a leading plus or minus sign", then it is your definition that is wrong.



-- 
Steven


More information about the Tutor mailing list