64-bit sequence and buffer protocol
I'm posting to this list to again generate open discussion on the problem in current Python that an int is used in both the Python sequence protocol and the Python buffer protocol. The problem is that a C-int is typically only 4 bytes long while there are many applications (mmap for example), that would like to access sequences much larger than can be addressed with 32 bits. There are two aspects to this problem: 1) Some 64-bit systems still define an C-int as 4 bytes long (so even in-memory sequence objects could not be addressed using the sequence protocol). 2) Even 32-bit systems have occasion to sequence a more abstract object (perhaps it is not all in memory) which requires more than 32 bits to address. These are the solutions I've seen: 1) Convert all C-ints to Py_LONG_LONG in the sequence and buffer protocols. 2) Add new C-API's that mirror the current ones which use Py_LONG_LONG instead of the current int. 3) Change Python to use the mapping protocol first (even for slicing) when both the mapping and sequence protocols are defined. 4) Tell writers of such large objects to not use the sequence and/or buffer protocols and instead use the mapping protocol and a different "bytes" object (that currently they would have to implement themselves and ignore the buffer protocol C-API). What is the opinion of people on this list about how to fix the problem. I believe Martin was looking at the problem and had told Perry Greenfield he was "fixing it." Apparently at the recent PyCon, Perry and he talked and Martin said the problem is harder than he had initially thought. It would be good to document what some of this problems are so that the community can assist in fixing this problem. -Travis O.
Travis Oliphant wrote:
What is the opinion of people on this list about how to fix the problem. I believe Martin was looking at the problem and had told Perry Greenfield he was "fixing it." Apparently at the recent PyCon, Perry and he talked and Martin said the problem is harder than he had initially thought. It would be good to document what some of this problems are so that the community can assist in fixing this problem.
1) Convert all C-ints to Py_LONG_LONG in the sequence and buffer
I have put a patch on http://sourceforge.net/tracker/index.php?func=detail&aid=1166195&group_id=5470&atid=305470 which solves this problem (eventually); this is the pre-PyCon version; I'll update it to the post-PyCon version later this month. I'll also write a PEP with the proposed changes. protocols. This would be bad, since it would cause an overhead on 32-bit systems. Instead, I propose to change all C ints holding indexes and sizes to Py_ssize_t.
2) Add new C-API's that mirror the current ones which use Py_LONG_LONG instead of the current int.
I'll propose a type flag, where each type can indicate whether it expects indexes and sizes as int or as Py_ssize_t. However, there are more issues. In particular, PyArg_ParseTuple needs to change to expect a different index type for selected "i" arguments; it also needs to change to possibly store a different type into the length of a "s#" argument. This altogether doesn't support types that exceed 2**31 elements even on an 32-bit system (or 2**63 elements on a 64-bit system); authors of such types would have to follow the advise
4) Tell writers of such large objects to not use the sequence and/or buffer protocols and instead use the mapping protocol and a different " bytes" object (that currently they would have to implement themselves and ignore the buffer protocol C-API).
Regards, Martin
participants (2)
-
"Martin v. Löwis"
-
Travis Oliphant