Re: Memory size overflows

I didn't find any way to improve the actual overflow check, although if you entirely replace the "fast path" check with checks involving unsigned masking, you get some performance improvement. For a wide variety of input patterns, I get about an 18% speedup versus the core long multiply code, when modified as shown below: #define UL_LO_HI_BIT (((unsigned long)1) << (sizeof(unsigned long) * 4U)) #define UL_LO_MASK ((UL_LO_HI_BIT) - 1) #define UL_HI_MASK (~(UL_LO_MASK)) #define UL_HI_LO_BIT (((unsigned long)1) << ((sizeof(unsigned long) * 4U)-1)) #define UL_OVERFLOW_IMPOSSIBLE_MASK ((UL_HI_LO_BIT) - 1) #define UL_OVERFLOW_POSSIBLE_MASK (~(UL_OVERFLOW_IMPOSSIBLE_MASK)) long core_int_mul(long a, long b) { long longprod = a * b; unsigned long ma = a & UL_HI_MASK; if (ma == ((a < 0) ? UL_HI_MASK : 0)) { unsigned long mb = b & UL_OVERFLOW_POSSIBLE_MASK; if (mb == ((b < 0) ? UL_OVERFLOW_POSSIBLE_MASK : 0)) { return longprod; } } { double doubleprod = (double)a * (double)b; double doubled_longprod = (double)longprod; double diff = doubled_longprod - doubleprod; double absdiff = (diff >= 0.0) ? diff : -diff; double absprod = (doubleprod >= 0.0) ? doubleprod : -doubleprod; /* absdiff/absprod <= 1/32 iff 32 * absdiff <= absprod -- 5 good bits is "close enough" */ if (32.0 * absdiff <= absprod) { return longprod; } else { SIGNAL_AN_ERROR; } } } This version suffers no apparent degradation versus the existing implementation when fed sets of multiplicands evenly distributed over range(-sys.maxint, sys.maxint), and almost always shows an improvement. Shall I submit a patch? -Jerry

[Gerald S. Williams]
Which platform? Which compiler? What was your test driver? Was this timing the mult code in isolation, or timing Python-level multiplies? Claims of small speedups are notoriously platform- and test-dependent. If it's a mixed bag across platforms, the risk of introducing a new bug would favor leaving things alone. In the absence of a clear correctness proof, a Python simulation program demonstrating correctness exhaustively in small bases would also be helpful.
... Shall I submit a patch?
Sure, but also submit your timing harness so that people can measure the effects cross-platform and cross-compiler.

I wrote:
Shall I submit a patch?
Tim Peters wrote:
Sure, but also submit your timing harness so that people can measure the effects cross-platform and cross-compiler.
Of course. I first wanted to see if there was any interest. Integrated into Python, the speedup was much less: only 0.1% overall for this million-multiply script: import time start = time.time() a = -1000 while a <= 1000: b = 1000 while b <= 1000: b += 1 a += 1 end = time.time() base = end - start start = time.time() a = -1000 while a <= 1000: b = -1000 while b <= 1000: b += 1 c = a * b a += 1 end = time.time() print end - start - base The difference in speedup is an interesting data point on Python overhead in general (it took just over 6 seconds to do 1,000,000 multiplies on my system).
I doubt you're still interested, but I used Cygwin on a P3/833 Thinkpad running Win2K (GCC compiler). My original results were on just the multiply code in isolation--I wanted to see if there was interest before continuing. Since I have them, I attached my initial testbed (new.c is the final version), and here is an exhaustive test for the "cannot_overflow" check used in that version: from __future__ import generators def cannot_overflow(a,b,m): M_MASK = (1 << m) - 1 HI_LO_BIT = 1 << ((m/2) - 1) OVL_IMPOSSIBLE_MASK = HI_LO_BIT - 1 OVL_POSSIBLE_MASK = M_MASK ^ OVL_IMPOSSIBLE_MASK if a < 0: check = OVL_POSSIBLE_MASK else: check = 0 if (a & OVL_POSSIBLE_MASK) == check: if b < 0: check = OVL_POSSIBLE_MASK else: check = 0 if (b & OVL_POSSIBLE_MASK) == check: return True return False def overflowed(a,b,m): VALID_MASK = (2**(m - 1)) - 1 INVALID_MASK = (2**(2*m) - 1) ^ VALID_MASK prod = a * b highbits = prod & INVALID_MASK return (highbits != 0) and (highbits != INVALID_MASK) def checkall(m): for i in range(-(2**(m-1)),2**(m-1)): for j in range(-(2**(m-1)),2**(m-1)): if cannot_overflow(i,j,m) and overflowed(i,j,m): yield (i,j) if __name__ == "__main__": for m in range(2,11,2): print "M ==", m for i in checkall(m): print i print "END OF LIST" You can compare the "cannot_overflow" code to the code that would be added to int_mul. For the sake of minimizing impact, I limited the changes to defining UL_OVERFLOW_POSSIBLE_MASK as the equivalent of ~(2**(m/2-1) - 1) and this code, added immediately after converting the arguments into longs: { unsigned long ma = a & UL_OVERFLOW_POSSIBLE_MASK; if (ma == ((a < 0) ? UL_OVERFLOW_POSSIBLE_MASK : 0)) { unsigned long mb = b & UL_OVERFLOW_POSSIBLE_MASK; if (mb == ((b < 0) ? UL_OVERFLOW_POSSIBLE_MASK : 0)) { return PyInt_FromLong(a*b); } } } This seems pretty safe, but the payout seems too small to warrant taking this any further. Unless someone asks me to do so, I'm planning to drop it. -Jerry

[Gerald S. Williams]
Which platform? Which compiler? What was your test driver? Was this timing the mult code in isolation, or timing Python-level multiplies? Claims of small speedups are notoriously platform- and test-dependent. If it's a mixed bag across platforms, the risk of introducing a new bug would favor leaving things alone. In the absence of a clear correctness proof, a Python simulation program demonstrating correctness exhaustively in small bases would also be helpful.
... Shall I submit a patch?
Sure, but also submit your timing harness so that people can measure the effects cross-platform and cross-compiler.

I wrote:
Shall I submit a patch?
Tim Peters wrote:
Sure, but also submit your timing harness so that people can measure the effects cross-platform and cross-compiler.
Of course. I first wanted to see if there was any interest. Integrated into Python, the speedup was much less: only 0.1% overall for this million-multiply script: import time start = time.time() a = -1000 while a <= 1000: b = 1000 while b <= 1000: b += 1 a += 1 end = time.time() base = end - start start = time.time() a = -1000 while a <= 1000: b = -1000 while b <= 1000: b += 1 c = a * b a += 1 end = time.time() print end - start - base The difference in speedup is an interesting data point on Python overhead in general (it took just over 6 seconds to do 1,000,000 multiplies on my system).
I doubt you're still interested, but I used Cygwin on a P3/833 Thinkpad running Win2K (GCC compiler). My original results were on just the multiply code in isolation--I wanted to see if there was interest before continuing. Since I have them, I attached my initial testbed (new.c is the final version), and here is an exhaustive test for the "cannot_overflow" check used in that version: from __future__ import generators def cannot_overflow(a,b,m): M_MASK = (1 << m) - 1 HI_LO_BIT = 1 << ((m/2) - 1) OVL_IMPOSSIBLE_MASK = HI_LO_BIT - 1 OVL_POSSIBLE_MASK = M_MASK ^ OVL_IMPOSSIBLE_MASK if a < 0: check = OVL_POSSIBLE_MASK else: check = 0 if (a & OVL_POSSIBLE_MASK) == check: if b < 0: check = OVL_POSSIBLE_MASK else: check = 0 if (b & OVL_POSSIBLE_MASK) == check: return True return False def overflowed(a,b,m): VALID_MASK = (2**(m - 1)) - 1 INVALID_MASK = (2**(2*m) - 1) ^ VALID_MASK prod = a * b highbits = prod & INVALID_MASK return (highbits != 0) and (highbits != INVALID_MASK) def checkall(m): for i in range(-(2**(m-1)),2**(m-1)): for j in range(-(2**(m-1)),2**(m-1)): if cannot_overflow(i,j,m) and overflowed(i,j,m): yield (i,j) if __name__ == "__main__": for m in range(2,11,2): print "M ==", m for i in checkall(m): print i print "END OF LIST" You can compare the "cannot_overflow" code to the code that would be added to int_mul. For the sake of minimizing impact, I limited the changes to defining UL_OVERFLOW_POSSIBLE_MASK as the equivalent of ~(2**(m/2-1) - 1) and this code, added immediately after converting the arguments into longs: { unsigned long ma = a & UL_OVERFLOW_POSSIBLE_MASK; if (ma == ((a < 0) ? UL_OVERFLOW_POSSIBLE_MASK : 0)) { unsigned long mb = b & UL_OVERFLOW_POSSIBLE_MASK; if (mb == ((b < 0) ? UL_OVERFLOW_POSSIBLE_MASK : 0)) { return PyInt_FromLong(a*b); } } } This seems pretty safe, but the payout seems too small to warrant taking this any further. Unless someone asks me to do so, I'm planning to drop it. -Jerry
participants (2)
-
Gerald S. Williams
-
Tim Peters