issues with int/long on 64bit platforms - eg stringobject (PR#306)

The email below is a serious bug report. A quick analysis shows that UserString.count() calls the count() method on a string object, which calls PyArg_ParseTuple() with the format string "O|ii". The 'i' format code truncates integers. It probably should raise an overflow exception instead. But that would still cause the test to fail -- just in a different way (more explicit). Then the string methods should be fixed to use long ints instead -- and then something else would probably break... --Guido van Rossum (home page: http://www.python.org/~guido/) ------- Forwarded Message Date: Mon, 24 Apr 2000 19:26:27 -0400 From: mark.favas@per.dem.csiro.au To: python-bugs-list@python.org cc: bugs-py@python.org Subject: [Python-bugs-list] 1.6a2 issues with int/long on 64bit platforms - eg stringobject (PR#306) Full_Name: Mark Favas Version: 1.6a2 CVS of 25 April OS: DEC Alpha, Tru64 Unix 4.0F Submission from: wa107.dialup.csiro.au (130.116.4.107) There seems to be issues (and perhaps lurking cans of worms) on 64-bit platforms where sizeof(long) != sizeof(int). For example, the CVS version of 1.6a2 of 25 April fails the UserString regression test. The tests fail as follows (verbose set to 1): abcabcabc.count(('abc',)) no 'abcabcabc' <method UserString.count of UserString instance at 140217e40> 3 <> 2 abcabcabc.count(('abc', 1)) no 'abcabcabc' <method UserString.count of UserString instance at 140216580> 2 <> 1 abcdefghiabc.find(('abc', 1)) no 'abcdefghiabc' <method UserString.find of UserString instance at 140216730> 9 <
- -1 abcdefghiabc.rfind(('abc',)) no 'abcdefghiabc' <method UserString.rfind of UserString instance at 140216580> 9 <> 0 abcabcabc.rindex(('abc',)) no 'abcabcabc' <method UserString.rindex of UserString instance at 140216dc0> 6 <> 3 abcabcabc.rindex(('abc', 1)) no 'abcabcabc' <method UserString.rindex of UserString instance at 140216730> 6 <> 3 These tests are failing because the calls from the UserString methods to the underlying string methods are setting the default value of the end-of-string parameter to sys.maxint, which is defined as LONG_MAX (9223372036854775807), whereas the string methods in stringobject.c are using ints and expecting them to be no larger than INT_MAX (2147483647). Thus the end-of-string parameter becomes -1 in the default case. The size of an int on my platform is 4, and the size of a long is 8, so the "natural size of a Python integer" should be 8, by my understanding. The obvious fix is to change stringobject.c to use longs, rather than ints, but the problem might be more widespread than that. INT_MAX is used in unicodeobject.c, pypcre.c, _sre.c, stropmodule.c, and ceval.c as well as stringobject.c. Some of these look as though LONG_MAX should have been used (variables compared to INT_MAX are longs, but I am not confident enough to submit patches for them... Mark _______________________________________________ Python-bugs-list maillist - Python-bugs-list@python.org http://www.python.org/mailman/listinfo/python-bugs-list ------- End of Forwarded Message

Guido van Rossum wrote:
All uses in stringobject.c and unicodeobject.c use INT_MAX together with integers, so there's no problem on that side of the fence ;-) Since strings and Unicode objects use integers to describe the length of the object (as well as most if not all other builtin sequence types), the correct default value should thus be something like sys.maxlen which then gets set to INT_MAX. I'd suggest adding sys.maxlen and the modifying UserString.py, re.py and sre_parse.py accordingly.
-- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

Hm, I'm not so sure. It would be much better if passing sys.maxint would just WORK... Since that's what people have been doing so far. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido wrote:
Hm, I'm not so sure. It would be much better if passing sys.maxint would just WORK... Since that's what people have been doing so far.
Possible solutions (I give 4 of them): 1. The 'i' format code could raise an overflow exception and the PyArg_ParseTuple() call in string_count() could catch it and truncate to INT_MAX (reasoning that any overflow of the end position of a string can be bound to INT_MAX because that is the limit for any string in Python). Pros: - This "would just WORK" for usage of sys.maxint. Cons: - This overflow exception catching should then reasonably be propagated to other similar functions (like string.endswith(), etc). - We have to assume that the exception raised in the PyArg_ParseTuple(args, "O|ii:count", &subobj, &i, &last) call is for the second integer (i.e. 'last'). This is subtle and ugly. Pro or Con: - Do we want to start raising overflow exceptions for other conversion formats (i.e. 'b' and 'h' and 'l', the latter *can* overflow on Win64 where sizeof(long) < size(void*))? I think this is a good idea in principle but may break code (even if it *does* identify bugs in that code). 2. Just change the definitions of the UserString methods to pass a variable length argument list instead of default value parameters. For example change UserString.count() from: def count(self, sub, start=0, end=sys.maxint): return self.data.count(sub, start, end) to: def count(self, *args)): return self.data.count(*args) The result is that the default value for 'end' is now set by string_count() rather than by the UserString implementation:
Pros: - Easy change. - Fixes the immediate bug. - This is a safer way to copy the string behaviour in UserString anyway (is it not?). Cons: - Does not fix the general problem of the (common?) usage of sys.maxint to mean INT_MAX rather than the actual LONG_MAX (this matters on 64-bit Unices). - The UserString code is no longer really self-documenting. 3. As MAL suggested: add something like sys.maxlen (set to INT_MAX) with breaks the logical difference with sys.maxint (set to LONG_MAX): - sys.maxint == "the largest value a Python integer can hold" - sys.maxlen == "the largest value for the length of an object in Python (e.g. length of a string, length of an array)" Pros: - More explicit in that it separates two distinct meanings for sys.maxint (which now makes a difference on 64-bit Unices). - The code changes should be fairly straightforward. Cons: - Places in the code that still use sys.maxint where they should use sys.maxlen will unknowingly be overflowing ints and bringing about this bug. - Something else for coders to know about. 4. Add something like sys.maxlen, but set it to SIZET_MAX (c.f. ANSI size_t type). It is probably not a biggie, but Python currently makes the assumption that string never exceed INT_MAX in length. While this assumption is not likely to be proven false it technically could be on 64-bit systems. As well, when you start compiling on Win64 (where sizeof(int) == sizeof(long) < sizeof(size_t)) then you are going to be annoyed by hundreds of warnings about implicit casts from size_t (64-bits) to int (32-bits) for every strlen, str*, fwrite, and sizeof call that you make. Pros: - IMHO logically more correct. - Might clean up some subtle bugs. - Cleans up annoying and disconcerting warnings. - Will probably mean less pain down the road as 64-bit systems (esp. Win64) become more prevalent. Cons: - Lot of coding changes. - As Guido said: "and then something else would probably break". (Though, on currently 32-bits system, there should be no effective change). Only 64-bit systems should be affected and, I would hope, the effect would be a clean up. I apologize for not being succinct. Note that I am volunteering here. Opinions and guidance please. Trent

Trent Mick wrote:
Here is another one... I don't really like it because I think that silent truncations are a bad idea, but to make things "just work it would help: * Change PyArg_ParseTuple() to truncate the range(INT_MAX+1, LONG_MAX+1) to INT_MAX and the same for negative numbers when passing a Python integer to a "i" marked variable. This would map range(INT_MAX+1, LONG_MAX+1) to INT_MAX and thus sys.maxint would turn out as INT_MAX in all those cases where "i" is used as parser marker. Dito for negative values. With this truncation passing sys.maxint as default argument for length parameters would "just work" :-). The more radical alternative would be changing the Python object length fields to long -- I don't think this is practical though (and probably not really needed unless you intend to work with 3GB strings ;). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

On Sat, Apr 29, 2000 at 02:50:07PM +0200, M.-A. Lemburg wrote:
If we *do* make this change however, say "size_t" please, rather than long because on Win64 sizeof(long) < sizeof(size_t) == sizeof(void*).
practical though (and probably not really needed unless you intend to work with 3GB strings ;).
I know that 3GB+ strings are not likely to come along but if the length fields were size_t it would clean up implicit downcasts that you currently get from size_t to int on calls to strlen and the like on 64-bit systems. Trent

[Guido]
For people unfamiliar w/ the details, let's be explicit: the "i" code implicitly converts a Python int (== a C long) to a C int (which has no visible Python counterpart). Overflow is not detected, so this is broken on the face of it.
It probably should raise an overflow exception instead.
Definitely.
Yup. Seems inevitable. [MAL]
I understand this, but hate it. I'd rather get rid of the user-visible distinction between the two int types already there, not add yet a third artificial int limit. [Guido]
Hm, I'm not so sure. It would be much better if passing sys.maxint would just WORK... Since that's what people have been doing so far.
[Trent Mick]
There's stronger reason than that: string_count's "start" and "end" arguments are documented as "interpreted as in slice notation", and slice notation with out-of-range indices is well defined in all cases: The semantics for a simple slicing are as follows. The primary must evaluate to a sequence object. The lower and upper bound expressions, if present, must evaluate to plain integers; defaults are zero and the sequence's length, respectively. If either bound is negative, the sequence's length is added to it. The slicing now selects all items with index k such that i <= k < j where i and j are the specified lower and upper bounds. This may be an empty sequence. It is not an error if i or j lie outside the range of valid indexes (such items don't exist so they aren't selected). (From the Ref Man's section "Slicings") That is, what string_count should do is perfectly clear already (or will be, when you read that two more times <wink>). Note that you need to distinguish between positive and negative overflow, though!
Absolutely, but they *all* follow from what "sequence slicing* is *always* supposed to do in case of out-of-bounds indices.
Luckily <wink>, it's not that simple: exactly the same treatment needs to be given to both the optional "start" and "end" arguments, and in every function that accepts optional slice indices. So you write one utility function to deal with all that, called in case PyArg_ParseTuple raises an overflow error.
The code this would break is already broken <0.1 wink>.
This doesn't solve anything -- users can (& do) pass sys.maxint explicitly. That's part of what Guido means by "since that's what people have been doing so far".
Anyone using sys.maxint to mean INT_MAX is fatally confused; passing sys.maxint as a slice index is not an instance of that confusion, it's just relying on the documented behavior of out-of-bounds slice indices.
I hate this (see above).
It's not an assumption, it's an explicit part of the design: PyObject_VAR_HEAD declares ob_size to be an int. This leads to strain for sure, partly because the *natural* limit on sizes is derived from malloc (which indeed takes neither int nor long, but size_t), and partly because Python exposes no corresponding integer type. I vaguely recall that this was deliberate, with the *intent* being to save space in object headers on the upcoming 128-bit KSR machines <wink>.
While this assumption is not likely to be proven false it technically could be on 64-bit systems.
Well, Guido once said he would take away Python's recursion overflow checks just as soon as machines came with infinite memory <wink> -- 2Gb is a reasonable limit for string length, and especially if it's a tradeoff against increasing the header size for all string objects (it's almost certainly more important to cater to oodles of small strings on smaller machines than to one or two gigantic strings on huge machines).
Every place the code implicitly downcasts from size_t to int is plainly broken today, so we *should* get warnings. Python has been sloppy about this! In large part it's because Python was written before ANSI C, and size_t simply wasn't supported at the time. But as with all software, people rarely go back to clean up; it's overdue (just be thankful you're not working on the Perl source <0.9 wink>).
I support this as a long-term solution, perhaps for P3K. Note that ob_refcnt should also be declared size_t (no overflow-checking is done on refcounts today; the theory is that a refcount can't possibly get bigger than the total # of pointers in the system, and so if you declare ob_refcnt to be large enough to hold that, refcount overflow is impossible; but, in theory, this argument has been broken on every machine where sizeof(int) < sizeof(void*)).
I apologize for not being succinct.
Humbug -- it was a wonderfully concise posting, Trent! The issues are simply messy.
Note that I am volunteering here. Opinions and guidance please.
Alas, the first four letters in "guidance" spell out four-fifths of the only one able to give you that. opinions-are-fun-but-don't-count<wink>-ly y'rs - tim

I've just posted a simple patch to the patches list which implements the idea I posted earlier: Silent truncation still takes place, but in a somwhat more natural way ;-) ... /* Silently truncate to INT_MAX/INT_MIN to make passing sys.maxint to 'i' parser markers work on 64-bit platforms work just like on 32-bit platforms. Overflow errors are not raised. */ else if (ival > INT_MAX) ival = INT_MAX; else if (ival < INT_MIN) ival = INT_MIN; *p = ival; -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/

I posted a couple of patches a couple of days ago to correct the string methods implementing slice-like optional parameters (count, find, index, rfind, rindex) to properly clamp slice index values to the proper range (any PyInt or PyLong value is acceptible now). In fact the slice_index() function that was being used in ceval.c was reused (renamed to _PyEval_SliceIndex). As well, the other patch changes PyArg_ParseTuple's 'b', 'h', and 'i' formatters to raise an OverflowError if they overflow. Trent p.s. I thought I would whine here for some more attention. Who needs that Unicode stuff anyway. ;-)
participants (5)
-
Guido van Rossum
-
M.-A. Lemburg
-
Tim Peters
-
Trent Mick
-
Trent Mick