
Guido van Rossum wrote:
The email below is a serious bug report. A quick analysis shows that UserString.count() calls the count() method on a string object, which calls PyArg_ParseTuple() with the format string "O|ii". The 'i' format code truncates integers. It probably should raise an overflow exception instead. But that would still cause the test to fail -- just in a different way (more explicit). Then the string methods should be fixed to use long ints instead -- and then something else would probably break...
MAL wrote:
All uses in stringobject.c and unicodeobject.c use INT_MAX together with integers, so there's no problem on that side of the fence ;-)
Since strings and Unicode objects use integers to describe the length of the object (as well as most if not all other builtin sequence types), the correct default value should thus be something like sys.maxlen which then gets set to INT_MAX.
I'd suggest adding sys.maxlen and the modifying UserString.py, re.py and sre_parse.py accordingly.
Guido wrote:
Hm, I'm not so sure. It would be much better if passing sys.maxint would just WORK... Since that's what people have been doing so far.
Possible solutions (I give 4 of them): 1. The 'i' format code could raise an overflow exception and the PyArg_ParseTuple() call in string_count() could catch it and truncate to INT_MAX (reasoning that any overflow of the end position of a string can be bound to INT_MAX because that is the limit for any string in Python). Pros: - This "would just WORK" for usage of sys.maxint. Cons: - This overflow exception catching should then reasonably be propagated to other similar functions (like string.endswith(), etc). - We have to assume that the exception raised in the PyArg_ParseTuple(args, "O|ii:count", &subobj, &i, &last) call is for the second integer (i.e. 'last'). This is subtle and ugly. Pro or Con: - Do we want to start raising overflow exceptions for other conversion formats (i.e. 'b' and 'h' and 'l', the latter *can* overflow on Win64 where sizeof(long) < size(void*))? I think this is a good idea in principle but may break code (even if it *does* identify bugs in that code). 2. Just change the definitions of the UserString methods to pass a variable length argument list instead of default value parameters. For example change UserString.count() from: def count(self, sub, start=0, end=sys.maxint): return self.data.count(sub, start, end) to: def count(self, *args)): return self.data.count(*args) The result is that the default value for 'end' is now set by string_count() rather than by the UserString implementation:
from UserString import UserString s= 'abcabcabc' u = UserString('abcabcabc') s.count('abc') 3 u.count('abc') 3
Pros: - Easy change. - Fixes the immediate bug. - This is a safer way to copy the string behaviour in UserString anyway (is it not?). Cons: - Does not fix the general problem of the (common?) usage of sys.maxint to mean INT_MAX rather than the actual LONG_MAX (this matters on 64-bit Unices). - The UserString code is no longer really self-documenting. 3. As MAL suggested: add something like sys.maxlen (set to INT_MAX) with breaks the logical difference with sys.maxint (set to LONG_MAX): - sys.maxint == "the largest value a Python integer can hold" - sys.maxlen == "the largest value for the length of an object in Python (e.g. length of a string, length of an array)" Pros: - More explicit in that it separates two distinct meanings for sys.maxint (which now makes a difference on 64-bit Unices). - The code changes should be fairly straightforward. Cons: - Places in the code that still use sys.maxint where they should use sys.maxlen will unknowingly be overflowing ints and bringing about this bug. - Something else for coders to know about. 4. Add something like sys.maxlen, but set it to SIZET_MAX (c.f. ANSI size_t type). It is probably not a biggie, but Python currently makes the assumption that string never exceed INT_MAX in length. While this assumption is not likely to be proven false it technically could be on 64-bit systems. As well, when you start compiling on Win64 (where sizeof(int) == sizeof(long) < sizeof(size_t)) then you are going to be annoyed by hundreds of warnings about implicit casts from size_t (64-bits) to int (32-bits) for every strlen, str*, fwrite, and sizeof call that you make. Pros: - IMHO logically more correct. - Might clean up some subtle bugs. - Cleans up annoying and disconcerting warnings. - Will probably mean less pain down the road as 64-bit systems (esp. Win64) become more prevalent. Cons: - Lot of coding changes. - As Guido said: "and then something else would probably break". (Though, on currently 32-bits system, there should be no effective change). Only 64-bit systems should be affected and, I would hope, the effect would be a clean up. I apologize for not being succinct. Note that I am volunteering here. Opinions and guidance please. Trent