[Python-Dev] issues with int/long on 64bit platforms - eg stringobject (PR#306)

Mon, 1 May 2000 02:31:05 -0400

[Guido]
> The email below is a serious bug report.  A quick analysis
> shows that UserString.count() calls the count() method on a string
> object, which calls PyArg_ParseTuple() with the format string "O|ii".
> The 'i' format code truncates integers.

For people unfamiliar w/ the details, let's be explicit:  the "i" code
implicitly converts a Python int (== a C long) to a C int (which has no
visible Python counterpart).  Overflow is not detected, so this is broken on
the face of it.

> It probably should raise an overflow exception instead.

Definitely.

> But that would still cause the test to fail -- just in a different
> way (more explicit).  Then the string methods should be fixed to
> use long ints instead -- and then something else would probably break...

Yup.  Seems inevitable.

[MAL]
> Since strings and Unicode objects use integers to describe the
> length of the object (as well as most if not all other
> builtin sequence types), the correct default value should
> thus be something like sys.maxlen which then gets set to
> INT_MAX.
>
> I'd suggest adding sys.maxlen and the modifying UserString.py,
> re.py and sre_parse.py accordingly.

I understand this, but hate it.  I'd rather get rid of the user-visible
distinction between the two int types already there, not add yet a third
artificial int limit.

[Guido]
> Hm, I'm not so sure.  It would be much better if passing sys.maxint
> would just WORK...  Since that's what people have been doing so far.

[Trent Mick]
> Possible solutions (I give 4 of them):
>
> 1. The 'i' format code could raise an overflow exception and the
> PyArg_ParseTuple() call in string_count() could catch it and truncate to
> INT_MAX (reasoning that any overflow of the end position of a
> string can be bound to INT_MAX because that is the limit for any string
> in Python).

There's stronger reason than that:  string_count's "start" and "end"
arguments are documented as "interpreted as in slice notation", and slice
notation with out-of-range indices is well defined in all cases:

    The semantics for a simple slicing are as follows. The primary
    must evaluate to a sequence object. The lower and upper bound
    expressions, if present, must evaluate to plain integers; defaults
    are zero and the sequence's length, respectively. If either bound
    is negative, the sequence's length is added to it. The slicing now
    selects all items with index k such that i <= k < j where i and j
    are the specified lower and upper bounds. This may be an empty
    sequence. It is not an error if i or j lie outside the range of
    valid indexes (such items don't exist so they aren't selected).

(From the Ref Man's section "Slicings")  That is, what string_count should
do is perfectly clear already (or will be, when you read that two more times
<wink>).  Note that you need to distinguish between positive and negative
overflow, though!

> Pros:
> - This "would just WORK" for usage of sys.maxint.
>
> Cons:
> -  This overflow exception catching should then reasonably be
> propagated to other similar functions (like string.endswith(), etc).

Absolutely, but they *all* follow from what "sequence slicing* is *always*
supposed to do in case of out-of-bounds indices.

> - We have to assume that the exception raised in the
> PyArg_ParseTuple(args, "O|ii:count", &subobj, &i, &last) call is for
> the second integer (i.e. 'last'). This is subtle and ugly.

Luckily <wink>, it's not that simple:  exactly the same treatment needs to
be given to both the optional "start" and "end" arguments, and in every
function that accepts optional slice indices.  So you write one utility
function to deal with all that, called in case PyArg_ParseTuple raises an
overflow error.

> Pro or Con:
> - Do we want to start raising overflow exceptions for other conversion
> formats (i.e. 'b' and 'h' and 'l', the latter *can* overflow on
> Win64 where sizeof(long) < size(void*))? I think this is a good idea
> in principle but may break code (even if it *does* identify bugs in that
> code).

The code this would break is already broken <0.1 wink>.

> 2. Just change the definitions of the UserString methods to pass
> a variable length argument list instead of default value parameters.
> For example change UserString.count() from:
>
>     def count(self, sub, start=0, end=sys.maxint):
>         return self.data.count(sub, start, end)
>
> to:
>
>     def count(self, *args)):
>         return self.data.count(*args)
>
> The result is that the default value for 'end' is now set by
> string_count() rather than by the UserString implementation:
> ...

This doesn't solve anything -- users can (& do) pass sys.maxint explicitly.
That's part of what Guido means by "since that's what people have been doing
so far".

> ...
> Cons:
> - Does not fix the general problem of the (common?) usage of sys.maxint to
> mean INT_MAX rather than the actual LONG_MAX (this matters on 64-bit
> Unices).

Anyone using sys.maxint to mean INT_MAX is fatally confused; passing
sys.maxint as a slice index is not an instance of that confusion, it's just
relying on the documented behavior of out-of-bounds slice indices.

> 3. As MAL suggested: add something like sys.maxlen (set to INT_MAX) with
> breaks the logical difference with sys.maxint (set to LONG_MAX):
> ...

I hate this (see above).

> ...
> 4. Add something like sys.maxlen, but set it to SIZET_MAX (c.f.
> ANSI size_t type). It is probably not a biggie, but Python currently
> makes the assumption that string never exceed INT_MAX in length.

It's not an assumption, it's an explicit part of the design:
PyObject_VAR_HEAD declares ob_size to be an int.  This leads to strain for
sure, partly because the *natural* limit on sizes is derived from malloc
(which indeed takes neither int nor long, but size_t), and partly because
Python exposes no corresponding integer type.  I vaguely recall that this
was deliberate, with the *intent* being to save space in object headers on
the upcoming 128-bit KSR machines <wink>.

> While this assumption is not likely to be proven false it technically
> could be on 64-bit systems.

Well, Guido once said he would take away Python's recursion overflow checks
just as soon as machines came with infinite memory <wink> -- 2Gb is a
reasonable limit for string length, and especially if it's a tradeoff
against increasing the header size for all string objects (it's almost
certainly more important to cater to oodles of small strings on smaller
machines than to one or two gigantic strings on huge machines).

> As well, when you start compiling on Win64 (where sizeof(int) ==
> sizeof(long) < sizeof(size_t)) then you are going to be annoyed
> by hundreds of warnings about implicit casts from size_t (64-bits) to
> int (32-bits) for every strlen, str*, fwrite, and sizeof call that
> you make.

Every place the code implicitly downcasts from size_t to int is plainly
broken today, so we *should* get warnings.  Python has been sloppy about
this!  In large part it's because Python was written before ANSI C, and
size_t simply wasn't supported at the time.  But as with all software,
people rarely go back to clean up; it's overdue (just be thankful you're not
working on the Perl source <0.9 wink>).

> Pros:
> - IMHO logically more correct.
> - Might clean up some subtle bugs.
> - Cleans up annoying and disconcerting warnings.
> - Will probably mean less pain down the road as 64-bit systems
> (esp. Win64) become more prevalent.
>
> Cons:
> - Lot of coding changes.
> - As Guido said: "and then something else would probably break".
> (Though, on currently 32-bits system, there should be no effective
> change).  Only 64-bit systems should be affected and, I would hope,
> the effect would be a clean up.

I support this as a long-term solution, perhaps for P3K.  Note that
ob_refcnt should also be declared size_t (no overflow-checking is done on
refcounts today; the theory is that a refcount can't possibly get bigger
than the total # of pointers in the system, and so if you declare ob_refcnt
to be large enough to hold that, refcount overflow is impossible; but, in
theory, this argument has been broken on every machine where sizeof(int) <
sizeof(void*)).

> I apologize for not being succinct.

Humbug -- it was a wonderfully concise posting, Trent!  The issues are
simply messy.

> Note that I am volunteering here.  Opinions and guidance please.

Alas, the first four letters in "guidance" spell out four-fifths of the only
one able to give you that.

opinions-are-fun-but-don't-count<wink>-ly y'rs  - tim