Re: [Patches] make 'b','h','i' raise overflow exception
On Mon, May 08, 2000 at 10:00:30AM -0400, Guido van Rossum wrote:
Changes the 'b', 'h', and 'i' formatters in PyArg_ParseTuple to raise an Overflow exception if they overflow (previously they just silently overflowed).
Trent,
There's one issue with this: I believe the 'b' format is mostly used with unsigned character arguments in practice. However on systems with default signed characters, CHAR_MAX is 127 and values 128-255 are rejected. I'll change the overflow test to:
else if (ival > CHAR_MAX && ival >= 256) {
if that's okay with you.
Okay, I guess. Two things: 1. In a way this defeats the main purpose of the checks. Now a silent overflow could happen for a signed byte value over CHAR_MAX. The only way to automatically do the bounds checking is if the exact type is known, i.e. different formatters for signed and unsigned integral values. I don't know if this is desired (is it?). The obvious choice of 'u' prefixes to specify unsigned is obviously not an option. Another option might be to document 'b' as for unsigned chars and 'h', 'i', 'l' as signed integral values and then set the bounds checks ([0, UCHAR_MAX] for 'b') appropriately. Can we clamp these formatters so? I.e. we would be limiting the user to unsigned or signed depending on the formatter. (Which again, means that it would be nice to have different formatters for signed and unsigned.) I think that the bounds checking is false security unless these restrictions are made. 2. The above aside, I would be more inclined to change the line in question to: else if (ival > UCHAR_MAX) { as this is more explicit about what is being done.
Another issue however is that there are probably cases where an 'i' format is used (which can't overflow on 32-bit architectures) but where the int value is then copied into a short field without an additional check... I'm not sure how to fix this except by a complete inspection of all code... Not clear if it's worth it.
Yes, a complete code inspection seems to be the only way. That is some of what I am doing. Again, I have two questions: 1. There are a fairly large number of downcasting cases in the Python code (not necessarily tied to PyArg_ParseTuple results). I was wondering if you think a generalized check on each such downcast would be advisable. This would take the form of some macro that would do a bounds check before doing the cast. For example (a common one is the cast of strlen's size_t return value to int, because Python strings use int for their length, this is a downcast on 64-bit systems): size_t len = strlen(s); obj = PyString_FromStringAndSize(s, len); would become size_t len = strlen(s); obj = PyString_FromStringAndSize(s, CAST_TO_INT(len)); CAST_TO_INT would ensure that 'len'did not overflow and would raise an exception otherwise. Pros: - should never have to worry about overflows again - easy to find (given MSC warnings) and easy to code in (staightforward) Cons: - more code, more time to execute - looks ugly - have to check PyErr_Occurred every time a cast is done I would like other people's opinion on this kind of change. There are three possible answers: +1 this is a bad change idea because...<reason> -1 this is a good idea, go for it +0 (mostly likely) This is probably a good idea for some case where the overflow *could* happen, however the strlen example that you gave is *not* such a situation. As Tim Peters said: 2GB limit on string lengths is a good assumption/limitation. 2. Microsofts compiler gives good warnings for casts where information loss is possible. However, I cannot find a way to get similar warnings from gcc. Does anyone know if that is possible. I.e. int i = 123456; short s = i; // should warn about possible loss of information should give a compiler warning. Thanks, Trent -- Trent Mick trentm@activestate.com
[Trent]
Changes the 'b', 'h', and 'i' formatters in PyArg_ParseTuple to raise an Overflow exception if they overflow (previously they just silently overflowed).
[Guido]
There's one issue with this: I believe the 'b' format is mostly used with unsigned character arguments in practice. However on systems with default signed characters, CHAR_MAX is 127 and values 128-255 are rejected. I'll change the overflow test to:
else if (ival > CHAR_MAX && ival >= 256) {
if that's okay with you.
[Trent]
Okay, I guess. Two things:
1. In a way this defeats the main purpose of the checks. Now a silent overflow could happen for a signed byte value over CHAR_MAX. The only way to automatically do the bounds checking is if the exact type is known, i.e. different formatters for signed and unsigned integral values. I don't know if this is desired (is it?). The obvious choice of 'u' prefixes to specify unsigned is obviously not an option.
The struct module uses upper case for unsigned. I think this is overkill here, and would add a lot of code (if applied systematically) that would rarely be used.
Another option might be to document 'b' as for unsigned chars and 'h', 'i', 'l' as signed integral values and then set the bounds checks ([0, UCHAR_MAX] for 'b') appropriately. Can we clamp these formatters so? I.e. we would be limiting the user to unsigned or signed depending on the formatter. (Which again, means that it would be nice to have different formatters for signed and unsigned.) I think that the bounds checking is false security unless these restrictions are made.
I like this: 'b' is unsigned, the others are signed.
2. The above aside, I would be more inclined to change the line in question to:
else if (ival > UCHAR_MAX) {
as this is more explicit about what is being done.
Agreed.
Another issue however is that there are probably cases where an 'i' format is used (which can't overflow on 32-bit architectures) but where the int value is then copied into a short field without an additional check... I'm not sure how to fix this except by a complete inspection of all code... Not clear if it's worth it.
Yes, a complete code inspection seems to be the only way. That is some of what I am doing. Again, I have two questions:
1. There are a fairly large number of downcasting cases in the Python code (not necessarily tied to PyArg_ParseTuple results). I was wondering if you think a generalized check on each such downcast would be advisable. This would take the form of some macro that would do a bounds check before doing the cast. For example (a common one is the cast of strlen's size_t return value to int, because Python strings use int for their length, this is a downcast on 64-bit systems):
size_t len = strlen(s); obj = PyString_FromStringAndSize(s, len);
would become
size_t len = strlen(s); obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
CAST_TO_INT would ensure that 'len'did not overflow and would raise an exception otherwise.
Pros:
- should never have to worry about overflows again - easy to find (given MSC warnings) and easy to code in (staightforward)
Cons:
- more code, more time to execute - looks ugly - have to check PyErr_Occurred every time a cast is done
How would the CAST_TO_INT macro signal an erro? C doesn't have exceptions. If we have to add checks, I'd prefer to write size_t len = strlen(s); if (INT_OVERFLOW(len)) return NULL; /* Or whatever is appropriate in this context */ obj = PyString_FromStringAndSize(s, len);
I would like other people's opinion on this kind of change. There are three possible answers:
+1 this is a bad change idea because...<reason> -1 this is a good idea, go for it +0 (mostly likely) This is probably a good idea for some case where the overflow *could* happen, however the strlen example that you gave is *not* such a situation. As Tim Peters said: 2GB limit on string lengths is a good assumption/limitation.
-0
2. Microsofts compiler gives good warnings for casts where information loss is possible. However, I cannot find a way to get similar warnings from gcc. Does anyone know if that is possible. I.e.
int i = 123456; short s = i; // should warn about possible loss of information
should give a compiler warning.
Beats me :-( --Guido van Rossum (home page: http://www.python.org/~guido/)
[Trent]
1. There are a fairly large number of downcasting cases in the Python code (not necessarily tied to PyArg_ParseTuple results). I was wondering if you think a generalized check on each such downcast would be advisable. This would take the form of some macro that would do a bounds check before doing the cast. For example (a common one is the cast of strlen's size_t return value to int, because Python strings use int for their length, this is a downcast on 64-bit systems):
size_t len = strlen(s); obj = PyString_FromStringAndSize(s, len);
would become
size_t len = strlen(s); obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
CAST_TO_INT would ensure that 'len'did not overflow and would raise an exception otherwise.
[Guido]
How would the CAST_TO_INT macro signal an erro? C doesn't have exceptions. If we have to add checks, I'd prefer to write
size_t len = strlen(s); if (INT_OVERFLOW(len)) return NULL; /* Or whatever is appropriate in this context */ obj = PyString_FromStringAndSize(s, len);
Of course we have to add checks -- strlen doesn't return an int! It hasn't since about a year after Python was first written (ANSI C changed the rules, and Python is long overdue in catching up -- if you want people to stop passing multiple args to append, set a good example in our use of C <0.5 wink>). [Trent]
I would like other people's opinion on this kind of change. There are three possible answers:
Please don't change the rating scheme we've been using: -1 is a veto, +1 is a hurrah, -0 and +0 are obvious <ahem>.
+1 this is a bad change idea because...<reason> -1 this is a good idea, go for it
That one, except spelled +1.
+0 (mostly likely) This is probably a good idea for some case where the overflow *could* happen, however the strlen example that you gave is *not* such a situation. As Tim Peters said: 2GB limit on string lengths is a good assumption/limitation.
No, it's a defensible limitation, but it's *never* a valid assumption. The check isn't needed anywhere we can prove a priori that it could never fail (in which case we're not assuming anything), but it's always needed when we can't so prove (in which case skipping the check would be a bad asssuption). In the absence of any context, your strlen example above definitely needs the check. An alternative would be to promote the size member from int to size_t; that's no actual change on the 32-bit machines Guido generally assumes without realizing it, and removes an arbitrary (albeit defensible) limitation on some 64-bit machines at the cost of (just possibly, due to alignment vagaries) boosting var objects' header size on the latter. correctness-doesn't-happen-by-accident-ly y'rs - tim
An alternative would be to promote the size member from int to size_t; that's no actual change on the 32-bit machines Guido generally assumes without realizing it, and removes an arbitrary (albeit defensible) limitation on some 64-bit machines at the cost of (just possibly, due to alignment vagaries) boosting var objects' header size on the latter.
Then the signatures of many, many functions would have to be changed to take or return size_t, too -- almost anything in the Python/C API that *conceptually* is a size_t is declared as int; the ob_size field is only the tip of the iceberg. We'd also have to change the size of Python ints (currently long) to an integral type that can hold a size_t; on Windows (and I believe *only* on Windows) this is a long long, or however they spell it (except size_t is typically unsigned). This all is a major reworking -- not good for 1.6, even though I agree it needs to be done eventually. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Trent]
Another option might be to document 'b' as for unsigned chars and 'h', 'i', 'l' as signed integral values and then set the bounds checks ([0, UCHAR_MAX] for 'b') appropriately. Can we clamp these formatters so? I.e. we would be limiting the user to unsigned or signed depending on the formatter. (Which again, means that it would be nice to have different formatters for signed and unsigned.) I think that the bounds checking is false security unless these restrictions are made. [guido]
I like this: 'b' is unsigned, the others are signed.
Okay I will submit a patch for this them. 'b' formatter will limit values to [0, UCHAR_MAX].
[Trent]
1. There are a fairly large number of downcasting cases in the Python code (not necessarily tied to PyArg_ParseTuple results). I was wondering if you think a generalized check on each such downcast would be advisable. This would take the form of some macro that would do a bounds check before doing the cast. For example (a common one is the cast of strlen's size_t return value to int, because Python strings use int for their length, this is a downcast on 64-bit systems):
size_t len = strlen(s); obj = PyString_FromStringAndSize(s, len);
would become
size_t len = strlen(s); obj = PyString_FromStringAndSize(s, CAST_TO_INT(len));
CAST_TO_INT would ensure that 'len'did not overflow and would raise an exception otherwise.
[Guido]
How would the CAST_TO_INT macro signal an erro? C doesn't have exceptions. If we have to add checks, I'd prefer to write
size_t len = strlen(s); if (INT_OVERFLOW(len)) return NULL; /* Or whatever is appropriate in this context */ obj = PyString_FromStringAndSize(s, len);
[Tim]
Of course we have to add checks -- strlen doesn't return an int! It hasn't since about a year after Python was first written (ANSI C changed the rules, and Python is long overdue in catching up -- if you want people to stop passing multiple args to append, set a good example in our use of C <0.5 wink>).
The check isn't needed anywhere we can prove a priori that it could never fail (in which case we're not assuming anything), but it's always needed when we can't so prove (in which case skipping the check would be a bad asssuption). In the absence of any context, your strlen example above definitely needs the check.
Okay, I just wanted a go ahead that this kind of thing was desired. I will try to find the points where these overflows *can* happen and then I'll add checks in a manner closer to Guido syntax above.
[Trent]
I would like other people's opinion on this kind of change. There are three possible answers:
Please don't change the rating scheme we've been using: -1 is a veto, +1 is a hurrah, -0 and +0 are obvious <ahem>.
+1 this is a bad change idea because...<reason> -1 this is a good idea, go for it
Whoa, sorry Tim. I mixed up the +/- there. I did not intend to change the voting system. [Tim]
An alternative would be to promote the size member from int to size_t; that's no actual change on the 32-bit machines Guido generally assumes without realizing it, and removes an arbitrary (albeit defensible) limitation on some 64-bit machines at the cost of (just possibly, due to alignment vagaries) boosting var objects' header size on the latter.
I agree with Guido that this is too big an immediate change. I'll just try to find and catch the possible overflows. Thanks, Trent -- Trent Mick trentm@activestate.com
participants (3)
-
Guido van Rossum -
Tim Peters -
Trent Mick