Re: [Python-Dev] Memory size overflows

Guido van Rossum wrote:
Why not use the same trick for multiplication? For src1,src2 > 0, dest should always be >= MAX(src1,src2). SAFE_MULTIPLY could be implemented something like this: #define HALF_BITS (sizeof(size_t) * 4U) #define LO_MASK ((((size_t)1) << (HALF_BITS))-1) #define HI_MASK (~(LO_MASK)) #define MAX(a,b) (((a) >= (b)) ? (a) : (b)) #define SAFE_MULTIPLY(dest,src1,src2,on_error) \ { \ size_t _x = src1; \ size_t _y = src2; \ size_t _dest = _x * _y; \ \ if (_x && _y && ((_x|_y) & HI_MASK)) \ { \ if (_dest < MAX(_x,_y)) \ { \ on_error; \ } \ } \ \ dest = _dest; \ } -Jerry Williams

On Wed, Oct 16, 2002 at 11:12:05AM -0400, Gerald S. Williams wrote:
Is this correct for all operands? Consider 4-bit types, with valid values 0..15. 3*9 == 27, but 27%16 == 11 and 11>max(3, 9) There seem to be 15 distinct operand pairs for which this fails in the 4-bit case. def check(m): for x in range(m): for y in range(x, m): z = x*y w = z % m if z != w and w>x and w>y: yield x, y, z, w >>> print list(check(16)) [(3, 9, 27, 11), (3, 10, 30, 14), (4, 6, 24, 8), (4, 7, 28, 12), (4, 11, 44, 12), (5, 5, 25, 9), (5, 6, 30, 14), (5, 9, 45, 13), (6, 7, 42, 10), (6, 10, 60, 12), (6, 13, 78, 14), (7, 9, 63, 15), (7, 11, 77, 13), (10, 11, 110, 14), (11, 13, 143, 15)] >>> print len(list(check(256)) 9915 Jeff

Jeff Epler wrote:
You're right. It's a case of "fingers quicker than the brain". There may be something worth investigating in that type of relationship, but that isn't the answer. I came up with it as I was responding with a somewhat more working answer. What I was originally going to post before my mental lapse was this: One solution to what Guido asked for is as follows. It avoids the division, but I don't think it's worth it (it doesn't make much difference on my platform): #define FULL_BITS (sizeof(size_t) * 8U) #define TOP_BIT (((size_t)1) << ((FULL_BITS)-1)) #define HALF_BITS (sizeof(size_t) * 4U) #define LO_MASK ((((size_t)1) << (HALF_BITS))-1) #define HI_MASK (~(LO_MASK)) #define SAFE_MULTIPLY(dest,src1,src2,on_error)\ {\ size_t _x = src1;\ size_t _y = src2;\ \ dest = _x * _y;\ \ if ((_x|_y) & HI_MASK)\ {\ size_t h;\ size_t l;\ size_t mask;\ size_t low_bit;\ size_t low_bit_mask;\ \ if (_x >= _y)\ {\ h = _x;\ l = _y;\ }\ else\ {\ h = _y;\ l = _x;\ }\ \ for (low_bit=0,mask=TOP_BIT; mask & HI_MASK; ++low_bit,(mask>>=1))\ {\ if (h & mask)\ {\ break;\ }\ }\ \ low_bit_mask = (size_t)1 << low_bit;\ \ do\ {\ if (l & mask)\ {\ on_error;\ }\ mask >>= 1;\ }\ while (!(mask & low_bit_mask));\ }\ } Well, at least it demonstrates how to avoid multiply-invoking its parameters. :-) -Jerry

Actually, my original solution doesn't seem to stand up to comprehensive testing, either. Essentially all it was doing was checking to see if the sum of the high-order bit positions was less than the number of bits. I created a Python equivalent for smaller width numbers. At a width of 6 bits, the following combinations aren't reported as overflows but should be: (6, 13) (6, 14) (6, 15) (7, 11) (7, 12) (7, 13) (7, 14) (7, 15) Perhaps there is some property of the false negatives that can be exploited. I don't know if it helps, but there are certainly a number of ways to narrow the field: if (src1 & HIMASK) && (src2 & HIMASK), overflow=True if !(src1 & HIMASK) && !(src2 & HIMASK), overflow=False if !src1 || !src2, overflow=False if dest < MAX(src1,src2), overflow=True if log2(src1) + log2(src2) >= width, overflow=True (the integer form of log2 can be done quite cheaply by exploiting known properties of src1 and src2 at this point) But you'd still need to filter out false negatives from that field. Perhaps there's a way, but divides would have to be really expensive for this to be worth it. -Jerry

I felt obliged to post a working copy of SAFE_MULTIPLY. On my platform, it's slower than simply dividing by one of the multiplicands and comparing to the other, but it does get the job done without division. This is what I started out to write before my brain short-circuited. Here's a simplified Python representation of the algorithm: def check(x,y,m): xrem, xlog = math.frexp(x) yrem, ylog = math.frexp(y) if xlog + ylog == m + 1: return (xrem * yrem >= 0.5) else: return xlog + ylog > m + 1 And here it is in C: #define FULL_BITS (sizeof(size_t) * 8U) #define TOP_BIT (((size_t)1) << ((FULL_BITS)-1)) #define HALF_BITS (sizeof(size_t) * 4U) #define SAFE_MULTIPLY(dest,src1,src2,on_overflow)\ {\ size_t _x = src1;\ size_t _y = src2;\ size_t _dest = _x * _y;\ size_t h;\ int hlog;\ size_t l;\ int llog;\ size_t mask;\ \ dest = _dest;\ \ if (_x >= _y)\ {\ h = _x;\ l = _y;\ }\ else\ {\ h = _y;\ l = _x;\ }\ \ if (l)\ {\ for ((mask=TOP_BIT),(hlog=FULL_BITS-1); !(mask & h); mask >>= 1)\ {\ --hlog;\ }\ if (hlog >= HALF_BITS)\ {\ for (llog = hlog; !(mask & l); mask >>= 1)\ {\ --llog;\ }\ if (((hlog + llog) > (FULL_BITS - 1)) ||\ (((hlog + llog) == (FULL_BITS - 1)) &&\ (((h >> 1) * l + ((h & 1) ? (l >> 1) : 0)) >= TOP_BIT)))\ {\ on_overflow;\ }\ }\ }\ } -Jerry

Here's a version of my previous algorithm with the checks added back in to improve overall performance. This one runs about as fast as the divide-and-compare algorithm for my test cases. :-) OK, I'm done tweaking now... #define FULL_BITS (sizeof(size_t) * 8U) #define TOP_BIT (((size_t)1) << ((FULL_BITS)-1)) #define HALF_BITS (sizeof(size_t) * 4U) #define MID_BIT (((size_t)1) << ((HALF_BITS)-1)) #define LO_MASK ((((size_t)1) << (HALF_BITS))-1) #define HI_MASK (~(LO_MASK)) #define SAFE_MULTIPLY(dest,src1,src2,on_overflow)\ {\ size_t _x = src1;\ size_t _y = src2;\ size_t _dest = _x * _y;\ \ dest = _dest;\ \ if ((_x | _y) & HI_MASK)\ {\ size_t h;\ int hlog;\ size_t l;\ int llog;\ size_t mask;\ \ if (_x >= _y)\ {\ h = _x;\ l = _y;\ }\ else\ {\ h = _y;\ l = _x;\ }\ \ if ((h & HI_MASK) && (l))\ {\ if (l & HI_MASK)\ {\ hlog = llog = FULL_BITS;\ }\ else\ {\ for ((mask=TOP_BIT),(hlog=FULL_BITS-1);!(mask&h);mask>>=1)\ {\ --hlog;\ }\ for ((mask=MID_BIT),(llog=HALF_BITS-1);!(mask&l);mask>>=1)\ {\ --llog;\ }\ }\ if ((hlog + llog > FULL_BITS - 1) ||\ ((hlog + llog == FULL_BITS - 1) && !(_dest & TOP_BIT)))\ {\ on_overflow;\ }\ }\ }\ } -Jerry

Hello, On Thu, 17 Oct 2002, Gerald S. Williams wrote:
We might be running into code bloats thought. If we go for such big macros, I believe we should design them so that they run fast in the common case and call a helper function otherwise. Armin

Armin Rigo wrote:
Easily done, but first you have to ask yourself if it's really ever more efficient than this: #define SAFE_MULTIPLY(dest,src1,src2,on_overflow)\ {\ size_t _x = src1;\ size_t _y = src2;\ size_t _dest = _x * _y;\ \ dest = _dest;\ \ if ((_x | _y) & HI_MASK)\ {\ if ((_dest / _x) != _y)\ {\ on_overflow;\ }\ }\ } If so, here's a macro/helper version: #define FULL_BITS (sizeof(size_t) * 8U) #define TOP_BIT (((size_t)1) << ((FULL_BITS)-1)) #define HALF_BITS (sizeof(size_t) * 4U) #define MID_BIT (((size_t)1) << ((HALF_BITS)-1)) #define LO_MASK ((((size_t)1) << (HALF_BITS))-1) #define HI_MASK (~(LO_MASK)) #define SAFE_MULTIPLY(dest,src1,src2,on_overflow)\ {\ size_t _x = src1;\ size_t _y = src2;\ size_t _dest = _x * _y;\ \ dest = _dest;\ \ if ((_x | _y) & HI_MASK)\ {\ if (_safe_multiply_check_for_overflow(_x,_y,_dest))\ {\ on_overflow;\ }\ }\ } int _safe_multiply_check_for_overflow( size_t x, size_t y, size_t dest) { size_t h; size_t l; if (x >= y) { h = x; l = y; } else { h = y; l = x; } if ((h & HI_MASK) && (l)) { int hlog; int llog; if (l & HI_MASK) { hlog = llog = FULL_BITS; } else { size_t mask; for ((mask=TOP_BIT),(hlog=FULL_BITS-1);!(mask&h);mask>>=1) { --hlog; } for ((mask=MID_BIT),(llog=HALF_BITS-1);!(mask&l);mask>>=1) { --llog; } } if ((hlog + llog > FULL_BITS - 1) || ((hlog + llog == FULL_BITS - 1) && !(dest & TOP_BIT))) { return 1; } } return 0; } Of course there are also the alternatives of using floating point numbers (assuming the mantissa has as many bits as size_t), or a double-width intermediate representation, if available. There are also some other tricks that can be used to optimize this solution. The time-consuming part is finding out if the sum of the log2's of the multiplicands (i.e., highest bit positions) is less than, greater than, or equal to <bits in size_t> - 1. But I said I was done tweaking for now. :-) -Jerry

I wrote:
But I said I was done tweaking for now. :-)
Well, I was done tweaking, but then I decided to try out one more performance tweak. This is in desperate need of commenting, but it's finally faster than checking via division on my machine: #define FULL_BITS (sizeof(size_t) * 8U) #define TOP_BIT (((size_t)1) << ((FULL_BITS)-1)) #define HALF_BITS (sizeof(size_t) * 4U) #define MID_BIT (((size_t)1) << ((HALF_BITS)-1)) #define LO_MASK ((((size_t)1) << (HALF_BITS))-1) #define HI_MASK (~(LO_MASK)) #define Q_BITS (sizeof(size_t) * 2U) #define TQ_BITS (sizeof(size_t) * 6U) #define TQ_BIT (((size_t)1) << ((TQ_BITS)-1)) #define LTQ_MASK ((((size_t)1) << (TQ_BITS))-1) #define TQ_MASK (~(LTQ_MASK)) #define SAFE_MULTIPLY(dest,src1,src2,on_overflow)\ {\ size_t _x = src1;\ size_t _y = src2;\ size_t _dest = _x * _y;\ \ dest = _dest;\ \ if ((_x | _y) & HI_MASK)\ {\ if (_safe_multiply_check_for_overflow(_x,_y,_dest))\ {\ on_overflow;\ }\ }\ } int _safe_multiply_check_for_overflow( size_t h, size_t l, size_t dest) { if (l > h) { size_t temp = l; l = h; h = temp; } if ((h & HI_MASK) && (l)) { size_t lgt; size_t leq; if (l & HI_MASK) { lgt = (size_t)(-1); leq = 0; } else { size_t hbit; size_t mask; if (h & TQ_MASK) { for ((mask=TOP_BIT),(hbit=0);!(mask&h);mask>>=1) { ++hbit; } } else { for ((mask=TQ_BIT),(hbit=Q_BITS);!(mask&h);mask>>=1) { ++hbit; } } leq = 1 << hbit; mask = leq - 1; lgt = ~mask ^ leq; } if ((l & lgt) || ((l & leq) && !(dest & TOP_BIT))) { return 1; } } return 0; } -Jerry

[Gerald S. Williams]
If you've actually got the energy to time these things <wink>, how about adapting the int-mul overflow checking code in Python's intobject.c's int_mul() function too? The CVS (ditto 2.2.2) version + does rely on floating point + doesn't use division + does work for signed integral types + doesn't assume that the number of mantissa bits in a double is enough to hold a full-precision integral product + does assume that the largest integral product is within the dynamic range of a double It could be simplified a little if it were specialized to unsigned mult.

Tim Peters wrote:
I'll take a look.
It could be simplified a little if it were specialized to unsigned mult.
You're simply referring to breaking down the operation into "remove sign", multiply, and "generate sign" stages, right? I'll definitely keep it in mind. -Jerry Williams

[Tim\
[Gerald S. Williams]
I'll take a look.
Thanks!
It could be simplified a little if it were specialized to unsigned mult.
Goodness no -- that kind of cleverness is usually buggy, and usually by failing to account for that the moral equivalent of -sys.maxint-1 is a fine product, but sys.maxint+1 isn't. I was talking specifically about the based-on-doubles int_mul(). In particular, the slow-path line const double absprod = doubleprod >= 0.0 ? doubleprod : -doubleprod; isn't needed if we know the product is >= 0.

Tim Peters wrote:
Agreed. Also, signed values needn't be 2's complement numbers, and the behavior of signed overflow isn't guaranteed (it could saturate, for instance). The use of doubles makes sense from a portability standpoint. Or you could break out the signs and use unsigned longs with preprocessor (and configure) checks, being careful about performance impact. :-) For example, you might need to do something equivalent to this: #if HAVE_LIMITS_H #define MAX_POSITIVE_LONG_VALUE (LONG_MAX) #if -(LONG_MAX) == LONG_MIN #define MAX_NEGATIVE_LONG_VALUE (LONG_MAX) #elif (-(LONG_MAX) - 1) == LONG_MIN #define MAX_NEGATIVE_LONG_VALUE ((LONG_MAX) + 1) #else #undef MAX_POSITIVE_LONG_VALUE ... -Jerry

[Gerald S. Williams]
Agreed. Also, signed values needn't be 2's complement numbers,
Python assumes 2's comp in several places already.
and the behavior of signed overflow isn't guaranteed (it could saturate, for instance).
But it doesn't assume anything about what happens on overflow, for either doubles or for signed integral types.
The use of doubles makes sense from a portability standpoint.
I thought you were going to time this <wink>?
Breaking out the signs means tests and branches. That's something the int_mult() code in Python tries to avoid. Unfortunately, there's a ton of tests and branches before we ever get to the multiplication, trying to guess whether this is really a "sequence * integer" case. But once we get to actually multiplying two ints, the fast path has only one test+branch (did it overflow?).

Tim Peters wrote:
Python assumes 2's comp in several places already.
Yeah, I just noticed that.
I thought you were going to time this <wink>?
I am. If you assume that, for the vast majority of multiplications, both multiplicands are in the range -sqrt(sys.maxint):sqrt(sys.maxint), the following gives a 15% performance improvement for me: unsigned long sa = (a < 0L); unsigned long sb = (b < 0L); unsigned long ua = sa ? -a : a; unsigned long ub = sb ? -b : b; if ((ua | ub) & OVERFLOW_POSSIBLE_MASK) { <CURRENT IMPLEMENTATION, AS IS> } If both numbers are evenly distributed from -sys.maxint:sys.maxint, this yields about a 3% slowdown, though. These numbers are just for the core multiply path, since I pulled it out in order to compare algorithms better. I'm looking into improving the rest, but it's difficult. -Jerry

On Wed, Oct 16, 2002 at 11:12:05AM -0400, Gerald S. Williams wrote:
Is this correct for all operands? Consider 4-bit types, with valid values 0..15. 3*9 == 27, but 27%16 == 11 and 11>max(3, 9) There seem to be 15 distinct operand pairs for which this fails in the 4-bit case. def check(m): for x in range(m): for y in range(x, m): z = x*y w = z % m if z != w and w>x and w>y: yield x, y, z, w >>> print list(check(16)) [(3, 9, 27, 11), (3, 10, 30, 14), (4, 6, 24, 8), (4, 7, 28, 12), (4, 11, 44, 12), (5, 5, 25, 9), (5, 6, 30, 14), (5, 9, 45, 13), (6, 7, 42, 10), (6, 10, 60, 12), (6, 13, 78, 14), (7, 9, 63, 15), (7, 11, 77, 13), (10, 11, 110, 14), (11, 13, 143, 15)] >>> print len(list(check(256)) 9915 Jeff

Jeff Epler wrote:
You're right. It's a case of "fingers quicker than the brain". There may be something worth investigating in that type of relationship, but that isn't the answer. I came up with it as I was responding with a somewhat more working answer. What I was originally going to post before my mental lapse was this: One solution to what Guido asked for is as follows. It avoids the division, but I don't think it's worth it (it doesn't make much difference on my platform): #define FULL_BITS (sizeof(size_t) * 8U) #define TOP_BIT (((size_t)1) << ((FULL_BITS)-1)) #define HALF_BITS (sizeof(size_t) * 4U) #define LO_MASK ((((size_t)1) << (HALF_BITS))-1) #define HI_MASK (~(LO_MASK)) #define SAFE_MULTIPLY(dest,src1,src2,on_error)\ {\ size_t _x = src1;\ size_t _y = src2;\ \ dest = _x * _y;\ \ if ((_x|_y) & HI_MASK)\ {\ size_t h;\ size_t l;\ size_t mask;\ size_t low_bit;\ size_t low_bit_mask;\ \ if (_x >= _y)\ {\ h = _x;\ l = _y;\ }\ else\ {\ h = _y;\ l = _x;\ }\ \ for (low_bit=0,mask=TOP_BIT; mask & HI_MASK; ++low_bit,(mask>>=1))\ {\ if (h & mask)\ {\ break;\ }\ }\ \ low_bit_mask = (size_t)1 << low_bit;\ \ do\ {\ if (l & mask)\ {\ on_error;\ }\ mask >>= 1;\ }\ while (!(mask & low_bit_mask));\ }\ } Well, at least it demonstrates how to avoid multiply-invoking its parameters. :-) -Jerry

Actually, my original solution doesn't seem to stand up to comprehensive testing, either. Essentially all it was doing was checking to see if the sum of the high-order bit positions was less than the number of bits. I created a Python equivalent for smaller width numbers. At a width of 6 bits, the following combinations aren't reported as overflows but should be: (6, 13) (6, 14) (6, 15) (7, 11) (7, 12) (7, 13) (7, 14) (7, 15) Perhaps there is some property of the false negatives that can be exploited. I don't know if it helps, but there are certainly a number of ways to narrow the field: if (src1 & HIMASK) && (src2 & HIMASK), overflow=True if !(src1 & HIMASK) && !(src2 & HIMASK), overflow=False if !src1 || !src2, overflow=False if dest < MAX(src1,src2), overflow=True if log2(src1) + log2(src2) >= width, overflow=True (the integer form of log2 can be done quite cheaply by exploiting known properties of src1 and src2 at this point) But you'd still need to filter out false negatives from that field. Perhaps there's a way, but divides would have to be really expensive for this to be worth it. -Jerry

I felt obliged to post a working copy of SAFE_MULTIPLY. On my platform, it's slower than simply dividing by one of the multiplicands and comparing to the other, but it does get the job done without division. This is what I started out to write before my brain short-circuited. Here's a simplified Python representation of the algorithm: def check(x,y,m): xrem, xlog = math.frexp(x) yrem, ylog = math.frexp(y) if xlog + ylog == m + 1: return (xrem * yrem >= 0.5) else: return xlog + ylog > m + 1 And here it is in C: #define FULL_BITS (sizeof(size_t) * 8U) #define TOP_BIT (((size_t)1) << ((FULL_BITS)-1)) #define HALF_BITS (sizeof(size_t) * 4U) #define SAFE_MULTIPLY(dest,src1,src2,on_overflow)\ {\ size_t _x = src1;\ size_t _y = src2;\ size_t _dest = _x * _y;\ size_t h;\ int hlog;\ size_t l;\ int llog;\ size_t mask;\ \ dest = _dest;\ \ if (_x >= _y)\ {\ h = _x;\ l = _y;\ }\ else\ {\ h = _y;\ l = _x;\ }\ \ if (l)\ {\ for ((mask=TOP_BIT),(hlog=FULL_BITS-1); !(mask & h); mask >>= 1)\ {\ --hlog;\ }\ if (hlog >= HALF_BITS)\ {\ for (llog = hlog; !(mask & l); mask >>= 1)\ {\ --llog;\ }\ if (((hlog + llog) > (FULL_BITS - 1)) ||\ (((hlog + llog) == (FULL_BITS - 1)) &&\ (((h >> 1) * l + ((h & 1) ? (l >> 1) : 0)) >= TOP_BIT)))\ {\ on_overflow;\ }\ }\ }\ } -Jerry

Here's a version of my previous algorithm with the checks added back in to improve overall performance. This one runs about as fast as the divide-and-compare algorithm for my test cases. :-) OK, I'm done tweaking now... #define FULL_BITS (sizeof(size_t) * 8U) #define TOP_BIT (((size_t)1) << ((FULL_BITS)-1)) #define HALF_BITS (sizeof(size_t) * 4U) #define MID_BIT (((size_t)1) << ((HALF_BITS)-1)) #define LO_MASK ((((size_t)1) << (HALF_BITS))-1) #define HI_MASK (~(LO_MASK)) #define SAFE_MULTIPLY(dest,src1,src2,on_overflow)\ {\ size_t _x = src1;\ size_t _y = src2;\ size_t _dest = _x * _y;\ \ dest = _dest;\ \ if ((_x | _y) & HI_MASK)\ {\ size_t h;\ int hlog;\ size_t l;\ int llog;\ size_t mask;\ \ if (_x >= _y)\ {\ h = _x;\ l = _y;\ }\ else\ {\ h = _y;\ l = _x;\ }\ \ if ((h & HI_MASK) && (l))\ {\ if (l & HI_MASK)\ {\ hlog = llog = FULL_BITS;\ }\ else\ {\ for ((mask=TOP_BIT),(hlog=FULL_BITS-1);!(mask&h);mask>>=1)\ {\ --hlog;\ }\ for ((mask=MID_BIT),(llog=HALF_BITS-1);!(mask&l);mask>>=1)\ {\ --llog;\ }\ }\ if ((hlog + llog > FULL_BITS - 1) ||\ ((hlog + llog == FULL_BITS - 1) && !(_dest & TOP_BIT)))\ {\ on_overflow;\ }\ }\ }\ } -Jerry

Hello, On Thu, 17 Oct 2002, Gerald S. Williams wrote:
We might be running into code bloats thought. If we go for such big macros, I believe we should design them so that they run fast in the common case and call a helper function otherwise. Armin

Armin Rigo wrote:
Easily done, but first you have to ask yourself if it's really ever more efficient than this: #define SAFE_MULTIPLY(dest,src1,src2,on_overflow)\ {\ size_t _x = src1;\ size_t _y = src2;\ size_t _dest = _x * _y;\ \ dest = _dest;\ \ if ((_x | _y) & HI_MASK)\ {\ if ((_dest / _x) != _y)\ {\ on_overflow;\ }\ }\ } If so, here's a macro/helper version: #define FULL_BITS (sizeof(size_t) * 8U) #define TOP_BIT (((size_t)1) << ((FULL_BITS)-1)) #define HALF_BITS (sizeof(size_t) * 4U) #define MID_BIT (((size_t)1) << ((HALF_BITS)-1)) #define LO_MASK ((((size_t)1) << (HALF_BITS))-1) #define HI_MASK (~(LO_MASK)) #define SAFE_MULTIPLY(dest,src1,src2,on_overflow)\ {\ size_t _x = src1;\ size_t _y = src2;\ size_t _dest = _x * _y;\ \ dest = _dest;\ \ if ((_x | _y) & HI_MASK)\ {\ if (_safe_multiply_check_for_overflow(_x,_y,_dest))\ {\ on_overflow;\ }\ }\ } int _safe_multiply_check_for_overflow( size_t x, size_t y, size_t dest) { size_t h; size_t l; if (x >= y) { h = x; l = y; } else { h = y; l = x; } if ((h & HI_MASK) && (l)) { int hlog; int llog; if (l & HI_MASK) { hlog = llog = FULL_BITS; } else { size_t mask; for ((mask=TOP_BIT),(hlog=FULL_BITS-1);!(mask&h);mask>>=1) { --hlog; } for ((mask=MID_BIT),(llog=HALF_BITS-1);!(mask&l);mask>>=1) { --llog; } } if ((hlog + llog > FULL_BITS - 1) || ((hlog + llog == FULL_BITS - 1) && !(dest & TOP_BIT))) { return 1; } } return 0; } Of course there are also the alternatives of using floating point numbers (assuming the mantissa has as many bits as size_t), or a double-width intermediate representation, if available. There are also some other tricks that can be used to optimize this solution. The time-consuming part is finding out if the sum of the log2's of the multiplicands (i.e., highest bit positions) is less than, greater than, or equal to <bits in size_t> - 1. But I said I was done tweaking for now. :-) -Jerry

I wrote:
But I said I was done tweaking for now. :-)
Well, I was done tweaking, but then I decided to try out one more performance tweak. This is in desperate need of commenting, but it's finally faster than checking via division on my machine: #define FULL_BITS (sizeof(size_t) * 8U) #define TOP_BIT (((size_t)1) << ((FULL_BITS)-1)) #define HALF_BITS (sizeof(size_t) * 4U) #define MID_BIT (((size_t)1) << ((HALF_BITS)-1)) #define LO_MASK ((((size_t)1) << (HALF_BITS))-1) #define HI_MASK (~(LO_MASK)) #define Q_BITS (sizeof(size_t) * 2U) #define TQ_BITS (sizeof(size_t) * 6U) #define TQ_BIT (((size_t)1) << ((TQ_BITS)-1)) #define LTQ_MASK ((((size_t)1) << (TQ_BITS))-1) #define TQ_MASK (~(LTQ_MASK)) #define SAFE_MULTIPLY(dest,src1,src2,on_overflow)\ {\ size_t _x = src1;\ size_t _y = src2;\ size_t _dest = _x * _y;\ \ dest = _dest;\ \ if ((_x | _y) & HI_MASK)\ {\ if (_safe_multiply_check_for_overflow(_x,_y,_dest))\ {\ on_overflow;\ }\ }\ } int _safe_multiply_check_for_overflow( size_t h, size_t l, size_t dest) { if (l > h) { size_t temp = l; l = h; h = temp; } if ((h & HI_MASK) && (l)) { size_t lgt; size_t leq; if (l & HI_MASK) { lgt = (size_t)(-1); leq = 0; } else { size_t hbit; size_t mask; if (h & TQ_MASK) { for ((mask=TOP_BIT),(hbit=0);!(mask&h);mask>>=1) { ++hbit; } } else { for ((mask=TQ_BIT),(hbit=Q_BITS);!(mask&h);mask>>=1) { ++hbit; } } leq = 1 << hbit; mask = leq - 1; lgt = ~mask ^ leq; } if ((l & lgt) || ((l & leq) && !(dest & TOP_BIT))) { return 1; } } return 0; } -Jerry

[Gerald S. Williams]
If you've actually got the energy to time these things <wink>, how about adapting the int-mul overflow checking code in Python's intobject.c's int_mul() function too? The CVS (ditto 2.2.2) version + does rely on floating point + doesn't use division + does work for signed integral types + doesn't assume that the number of mantissa bits in a double is enough to hold a full-precision integral product + does assume that the largest integral product is within the dynamic range of a double It could be simplified a little if it were specialized to unsigned mult.

Tim Peters wrote:
I'll take a look.
It could be simplified a little if it were specialized to unsigned mult.
You're simply referring to breaking down the operation into "remove sign", multiply, and "generate sign" stages, right? I'll definitely keep it in mind. -Jerry Williams

[Tim\
[Gerald S. Williams]
I'll take a look.
Thanks!
It could be simplified a little if it were specialized to unsigned mult.
Goodness no -- that kind of cleverness is usually buggy, and usually by failing to account for that the moral equivalent of -sys.maxint-1 is a fine product, but sys.maxint+1 isn't. I was talking specifically about the based-on-doubles int_mul(). In particular, the slow-path line const double absprod = doubleprod >= 0.0 ? doubleprod : -doubleprod; isn't needed if we know the product is >= 0.

Tim Peters wrote:
Agreed. Also, signed values needn't be 2's complement numbers, and the behavior of signed overflow isn't guaranteed (it could saturate, for instance). The use of doubles makes sense from a portability standpoint. Or you could break out the signs and use unsigned longs with preprocessor (and configure) checks, being careful about performance impact. :-) For example, you might need to do something equivalent to this: #if HAVE_LIMITS_H #define MAX_POSITIVE_LONG_VALUE (LONG_MAX) #if -(LONG_MAX) == LONG_MIN #define MAX_NEGATIVE_LONG_VALUE (LONG_MAX) #elif (-(LONG_MAX) - 1) == LONG_MIN #define MAX_NEGATIVE_LONG_VALUE ((LONG_MAX) + 1) #else #undef MAX_POSITIVE_LONG_VALUE ... -Jerry

[Gerald S. Williams]
Agreed. Also, signed values needn't be 2's complement numbers,
Python assumes 2's comp in several places already.
and the behavior of signed overflow isn't guaranteed (it could saturate, for instance).
But it doesn't assume anything about what happens on overflow, for either doubles or for signed integral types.
The use of doubles makes sense from a portability standpoint.
I thought you were going to time this <wink>?
Breaking out the signs means tests and branches. That's something the int_mult() code in Python tries to avoid. Unfortunately, there's a ton of tests and branches before we ever get to the multiplication, trying to guess whether this is really a "sequence * integer" case. But once we get to actually multiplying two ints, the fast path has only one test+branch (did it overflow?).

Tim Peters wrote:
Python assumes 2's comp in several places already.
Yeah, I just noticed that.
I thought you were going to time this <wink>?
I am. If you assume that, for the vast majority of multiplications, both multiplicands are in the range -sqrt(sys.maxint):sqrt(sys.maxint), the following gives a 15% performance improvement for me: unsigned long sa = (a < 0L); unsigned long sb = (b < 0L); unsigned long ua = sa ? -a : a; unsigned long ub = sb ? -b : b; if ((ua | ub) & OVERFLOW_POSSIBLE_MASK) { <CURRENT IMPLEMENTATION, AS IS> } If both numbers are evenly distributed from -sys.maxint:sys.maxint, this yields about a 3% slowdown, though. These numbers are just for the core multiply path, since I pulled it out in order to compare algorithms better. I'm looking into improving the rest, but it's difficult. -Jerry
participants (4)
-
Armin Rigo
-
Gerald S. Williams
-
Jeff Epler
-
Tim Peters