[Tutor] three numbers for one

Oscar Benjamin oscar.j.benjamin at gmail.com
Tue Jun 11 16:21:57 CEST 2013


On 11 June 2013 00:38, Steven D'Aprano <steve at pearwood.info> wrote:
> On 10/06/13 22:55, Oscar Benjamin wrote:
>
> With respect, it's not that you don't want or need them, but that you don't
> *know* that you actually do want and need them. (I assume you are releasing
> software for others to use. If all your software is private, for your own
> use and nobody else, then you may not care.)

Not all of it, but the bulk of my software is intended to be used only
by me which obviously affects my attitude towards in a number of ways.

> If your software accepts
> numeric strings from the user -- perhaps it reads a file, perhaps it does
> something like this:
>
> number = int(input("Please enter a number: "))
>
> -- you want it to do the right thing when the user enters a number. Thanks
> to the Internet, your program is available to people all over the world.
> Well, in probably half the world, those digits are not necessarily the same
> as ASCII 0-9. Somebody downloads your app in Japan, points it at a data file
> containing fullwidth or halfwidth digits, and in Python 3 it just works.
> (Provided, of course, that you don't sabotage its ability to do so with
> inappropriate decimal only data validation.)

What exactly are these? I tried looking for the HALFWIDTH DIGIT ZERO
that you mentioned but I can't find it:

>>> '\N{HALFWIDTH DIGIT ZERO}'
  File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes
in position 0-23: unknown Unicode character name

I had thought that Japanese people just used Arabic numerals. I just
looked at Japanese numerals on Wikipedia and found that there is also
an alternative system but it is not strictly a base 10 scheme (like
Roman numerals):
http://en.wikipedia.org/wiki/Japanese_numerals
http://en.wikipedia.org/wiki/List_of_CJK_Unified_Ideographs,_part_1_of_4

The int() function rejects these characters:

O:\>py -3.3
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600
32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> int('\u4e00')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: '\u4e00'

(Also the Japanese numerals page shows a character 京 (kei) with a
bigger numeric value than the 兆 (chō) character that Eryksun referred
to earlier).

>> For example I very often pipe streams of ascii numeric text from one
>> program to another.
[snip]
>> In some cases the cost of converting to/from
>> decimal is actually significant and Python 3 will add to this both
>> with a more complex conversion
>
> Let your mind be at rest on this account. Python 3.3 int() is nearly twice
> as fast as Python 2.7 for short strings:
>
> [steve at ando ~]$ python2.7 -m timeit "int('12345')"
> 1000000 loops, best of 3: 0.924 usec per loop
> [steve at ando ~]$ python3.3 -m timeit "int('12345')"
> 1000000 loops, best of 3: 0.485 usec per loop

This is not really the appropriate test for what I was talking about.

[snip]
>
> and in any case, the time taken to convert to a string is trivial.
>
>> and with its encoding/decoding part of
>> the io stack. I'm wondering whether I should really just be using
>> binary mode for this kind of thing in Python 3 since this at least
>> removes an unnecessary part of the stack.
>
> I'm thinking that you're engaging in premature optimization. Have you
> profiled your code to confirm that the bottlenecks are where you think they
> are?

No, I haven't since I'm not properly using Python 3 yet. However I
have in the past profiled slow scripts running with Python 2 and found
that in some cases binary/decimal conversion for input/output seemed
to be a significant part of the time cost. The standard fix that I use
(if it's worth it) is to read and write in binary using
numpy.ndarray.tofile and numpy.fromfile. These are raw read/write
operations to/from a block of memory using the OS file descriptor and
are a lot faster. For small integers this can actually increase the
total number of bytes transferred but still be significantly faster. I
assume this is because it cuts out binary/decimal conversion and
bypasses the Python io stack.

To give a more appropriate measure of what I mean (On Windows XP):

enojb at ENM-OB:/o$ cat gen.py
#!/usr/bin/env python
from __future__ import print_function
import sys

# For a fair comparison:
try: from itertools import imap as map
except ImportError: pass
try: range = xrange
except NameError: pass

numlines = int(sys.argv[1])

for n in range(1, numlines + 1):
    print(' '.join(map(str, range(n, 10*n, n))))
enojb at ENM-OB:/o$ time py -2.7 gen.py 300000 > dump

real    0m6.860s
user    0m0.015s
sys     0m0.015s
enojb at ENM-OB:/o$ time py -2.7 gen.py 300000 > dump

real    0m6.891s
user    0m0.015s
sys     0m0.000s
enojb at ENM-OB:/o$ time py -3.2 gen.py 300000 > dump

real    0m8.016s
user    0m0.015s
sys     0m0.031s
enojb at ENM-OB:/o$ time py -3.2 gen.py 300000 > dump

real    0m7.953s
user    0m0.015s
sys     0m0.000s
enojb at ENM-OB:/o$ time py -3.3 gen.py 300000 > dump

real    0m9.109s
user    0m0.015s
sys     0m0.015s
enojb at ENM-OB:/o$ time py -3.3 gen.py 300000 > dump

real    0m9.063s
user    0m0.015s
sys     0m0.015s

So Python 3.3 is 30% slower than Python 2.7 in this benchmark. That's
not a show-stopper but it's something to think about in a long-running
script. I can recover Python 2.7 performance by manually encoding as
ascii and writing directly to sys.stdout.buffer:

enojb at ENM-OB:/o$ cat genb.py
#!/usr/bin/env python
import sys

numlines = int(sys.argv[1])

for n in range(1, numlines + 1):
    line = ' '.join(map(str, range(n, 10*n, n))).encode() + b'\r\n'
    sys.stdout.buffer.write(line)
enojb at ENM-OB:/o$ time py -3.3 genb.py 300000 > dumpb

real    0m6.829s
user    0m0.031s
sys     0m0.000s
enojb at ENM-OB:/o$ time py -3.3 genb.py 300000 > dumpb

real    0m6.890s
user    0m0.031s
sys     0m0.000s
enojb at ENM-OB:/o$ diff -qs dump dumpb
Files dump and dumpb are identical


>> In a previous thread where I moaned about the behaviour of the int()
>> function Eryksun suggested that it would be better if int() wan't used
>> for parsing strings at all. Since then I've thought about that and I
>> agree. There should be separate functions for each kind of string to
>> number conversion with one just for ascii decimal only.
>
> I think that is a terrible, terrible idea. It moves responsibility for
> something absolutely trivial ("convert a string to a number") from the
> language to the programmer, *who will get it wrong*.

That's the point though. Converting a piece of text to a corresponding
number is *not* trivial and there is no unique way that it should
work. So I think that there should be appropriately named functions
that perform the different types of conversion: int.fromdecimal,
int.fromhex, int.fromasciidigits (or something like that). I don't
think that int() should be used to convert strings and if it does then
it should only be to invert str(a) where a is an integer.


Oscar


More information about the Tutor mailing list