Deprecate the buffer object?

I happened to be looking at the buffer API today and I came across this posting from Guido: http://mail.python.org/pipermail/python-dev/2000-October/009974.html Over the years there has been a lot of discussion about the buffer API and the buffer object. The general consensus seems to be that the buffer API is not ideal but nonetheless useful. The buffer object, OTOH, is considered fundamentally broken and should be removed. Does anyone object to deprecating the 'buffer' builtin? Eventually we could remove the buffer object completely. Neil

On Tuesday 28 October 2003 11:09 pm, Neil Schemenauer wrote:
Is that about RW buffers specifically? Because I _have_ used R/O buffers in production code -- when I had a huge string already in memory, and needed various largish substrings of it at different but overlapping times, without paying the overhead to copy them as slicing would have done. Having 'buffer' as a buit-in was quite minor though -- considering the number of times I have used it, importing some module to get at it would have been perfectly acceptable, perhaps preferable. If the buffer interface stays but the function completely disappears, I guess it won't be too hard for me to recreate it in a tiny extension module, but it's not quite clear to me why I should need to. R/W buffers I've never used in production, though. I do recall once (at the very beginning of my Python usage) using an array's buffer_info method as a Q&D way to do some interfacing to C, but that was before ctypes, which I think is what i'd use now. Alex

On Tue, Oct 28, 2003 at 11:23:18PM +0100, Alex Martelli wrote:
Is that about RW buffers specifically?
No.
That's a useful thing to be able to do and the buffer object does it in a safe way. I guess that's part of the reason why the buffer object has managed to survive as long as it has. Neil

Raymond Hettinger wrote:
I trust you will preserve the functionality though? I have used the buffer() function to achieve great leaps in performance in applications which send data from a string buffer to a socket. Slicing kills performance in this scenario once buffer sizes get beyond a few 100 kB. Below is example from an asyncore.dispatcher subclass. This code sends chunks with maximum size, without ever slicing the buffer. def handle_write(self): if self.buffer_offset: sent = self.send(buffer(self.buffer, self.buffer_offset)) else: sent = self.send(self.buffer) self.buffer_offset += sent if self.buffer_offset == len(self.buffer): del self.buffer Troels

Looks like I was a little quick sending out that message. I found more recent postings from Tim and Guido: http://mail.python.org/pipermail/python-dev/2002-July/026408.html http://mail.python.org/pipermail/python-dev/2002-July/026413.html Slippery little beast, that buffer object. :-) I'm going to go ahead and add deprecation warnings. Neil

Neil Schemenauer <nas-python@python.ca> writes:
I used it once in combination with ctypes as buffer(a-ctypes-object) to get at the raw memory whicy ctypes objects expose via the buffer API. But it was pretty obscure, and I would happily have used an external module. Like this:
Basically, the only serious use case is getting the bytes out of objects which support the buffer API but which *don't* offer a "get the bytes out" interface. I've just realised that I could, however, also do this via the array module:
There's an extra copy in there. Disaster :-) Nope, I don't think there's a good use case after all... Paul -- This signature intentionally left blank

Neil Schemenauer <nas-python@python.ca>:
The buffer object, OTOH, is considered fundamentally broken and should be removed.
There's no doubt that the current implementation of it is unacceptably dangerous, but I haven't yet seen an argument that convinces me that it couldn't be fixed if desired. I don't think the *idea* of a buffer object is fundamentally flawed, and it seems potentially useful (although I must admit that I haven't found a need for it myself yet). Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On Wed, Oct 29, 2003 at 02:41:54PM +1300, Greg Ewing wrote:
Okay. Perhaps I am missing something but would fixing it be as simple as adding another field to the tp_as_buffer struct? /* references returned by the buffer functins are valid while * the object remains alive */ #define PyBuffer_FLAG_SAFE 1 Then in stringobject.c (and elsewhere as appropriate): static PyBufferProcs buffer_as_buffer = { (getreadbufferproc)buffer_getreadbuf, (getwritebufferproc)buffer_getwritebuf, (getsegcountproc)buffer_getsegcount, (getcharbufferproc)buffer_getcharbuf, PyBuffer_FLAG_SAFE, }; Then change bufferobject so that it can only be created from objects that set PyBuffer_FLAG_SAFE. Neil

I don't know if this is enough, but if it is, I'd recommend adding the flag bitto tp_flags rather than extending the buffer structure (since you'd need to allocate an extra bit for tp_flags anyway to indicate the longer buffer struct). --Guido van Rossum (home page: http://www.python.org/~guido/)

Neil Schemenauer
As the essence of the solution, I think that sounds good! I think that the following should also be done: * Update the docs for the buffer functions to indicate that these are *short term* pointers, that are not guaranteed once *any* Python code is called. * Add new public buffer functions with "LongTerm" in the name (and docs that buffer is valid as long as the object). These check the flag as you propose. * Buffer object uses new LongTerm buffer functions. It points out that the buffer object itself is less at fault than the interface. I'm trying to short-circuit bugs in external extension modules that use the buffer functions without realizing the subtle assumptions made. Mark.

Neil Schemenauer:
That's completely different from what I had in mind, which was: (1) Keep a reference to the base object in the buffer object, and (2) Use the buffer API to fetch a fresh pointer from the base object each time it's needed. Is there some reason that still wouldn't be safe enough? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On Thu, Oct 30, 2003 at 03:30:18PM +1300, Greg Ewing wrote:
That's completely different from what I had in mind, which was:
(1) Keep a reference to the base object in the buffer object, and
It already does this.
I don't see any problem with that. It's probably a better solution since it doesn't require a new flag and it lets you create buffers that reference objects like arrays. Neil

On Thu, Oct 30, 2003 at 07:21:01AM -0800, Neil Schemenauer wrote:
I don't see any problem with that.
Okay, small problem. The hash function for the buffer object is brain damaged, in more ways than one actually: >>> import array >>> a = array.array('c') >>> b = buffer(a) >>> hash(b) Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 16384 (LWP 5311)] buffer_hash (self=0x40262d00) at Objects/bufferobject.c:241 241 x = *p << 7; (gdb) l 236 return -1; 237 } 238 239 len = self->b_size; 240 p = (unsigned char *) self->b_ptr; 241 x = *p << 7; 242 while (--len >= 0) 243 x = (1000003*x) ^ *p++; 244 x ^= self->b_size; 245 if (x == -1) (gdb) p len $1 = 0 (gdb) p *p Cannot access memory at address 0x0 The buffer object has 'b_readonly' and 'b_hash' fields. If readonly is true than the object is considered hashable and once computed the hash is stored in the 'hash' field. The problem is that the buffer API doesn't provide a way to determine 'readonly'. The absence of getwritebuf() is not the same thing as being read only. The buffer() builtin always sets the 'readonly' flag! I don't think the buffer hash method can depend on the data being pointed to. There is nothing in the buffer interface that tells you if the data is immutable. The hash method could return the id of the buffer object but I'm not sure how useful that would be. Neil

Neil Schemenauer <nas-python@python.ca>:
How about just having it call the hash method of the base object? If the base object is hashable, this will do something reasonable, and if not, it will fail in the expected way. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Greg:
That would work, be less intrusive, and allow all existing code to work unchanged. My only concern is that it does not go anywhere towards fixing the buffer interface itself. To my mind, the buffer object is fairly useless and I never use it - so I really don't care. However, I do have real world uses for the buffer interface. The most compelling is for async IO in the Windows world - I need to pass a buffer Windows will fill in the background, and the buffer interface provides the solution - except for the flaws that also drip down to the buffer object, and leaves us with this problem. Thus, my preference is to fix the buffer object by fixing the interface as much as possible. Here is a sketch of a solution, incorporating both Neil and Greg's ideas: * Type object gets a new flag - TP_HAS_BUFFER_INFO, corresponding to a new 'getbufferinfoproc' slot in the PyBufferProcs structure (note - a function pointer, not static flags as Neil suggested) * New function 'getbufferinfoproc' returns a bitmask - Py_BUFFER_FIXED is one (and currently the only) flag that can be returned. * New buffer functions PyObject_AsFixedCharBuffer, etc. These check the new flag (and a type lacking TP_HAS_BUFFER_INFO is assumed to *not* be fixed) * Buffer object keeps a reference to the existing object (as it does now). Its getbufferinfoproc delegates to the underlying object. * Buffer object *never* keeps a pointer to the buffer - only to the object. Functions like tp_hash always re-fetch the buffer on demand. The buffer returned by the buffer object is then guaranteed to be as reliable as the underlying object. (This may be a semantic issue with hash(), but conceptually seems fine. Potential solution here - add Py_BUFFER_READONLY as a buffer flag, then hash() semantics could do the right thing) After all that, I can't help noticing Greg's solution would be far less work <wink>, Mark.

Hang on, didn't we already go through the process of designing a new buffer interface not long ago? What was decided about the results of that? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

"Mark Hammond" <mhammond@skippinet.com.au> writes:
I think that is a different issue entirely. While it may be interesting and important, can we at least try to keep them separate? Cheers, mwh -- This is the fixed point problem again; since all some implementors do is implement the compiler and libraries for compiler writing, the language becomes good at writing compilers and not much else! -- Brian Rogoff, comp.lang.functional

"Mark Hammond" <mhammond@skippinet.com.au> writes:
Well, there are two things people complain about a) the buffer INTERFACE b) the buffer OBJECT are the issues plaguing both the same? I wasn't under the impression they were. It's entirely possible I'm wrong, though. Cheers, mwh -- [1] If you're lost in the woods, just bury some fibre in the ground carrying data. Fairly soon a JCB will be along to cut it for you - follow the JCB back to civilsation/hitch a lift. -- Simon Burr, cam.misc

On Fri, Oct 31, 2003 at 09:21:06AM +1100, Mark Hammond wrote:
What does this flag mean? To my mind, there are several different types of memory buffers and the buffer interface does not distinguish between all of them. Is the size and position of the buffer fixed? Is the buffer immutable (it may be readonly by the buffer object but writable via some other mechanism)? The first question can be avoided by using Greg's idea of always refreshing the size and position. The second question cannot be answered using the current interface. I supposed if the buffer is immutable then it is implied that the its size and position is fixed.
You can't use the base objects hash if the buffer has a explicit size of offset. Neil

On Thu, Oct 30, 2003 at 03:30:18PM +1300, Greg Ewing wrote:
I've just uploaded a (rough) patch that implements your idea. http://www.python.org/sf/832058 Neil

On Tuesday 28 October 2003 11:09 pm, Neil Schemenauer wrote:
Is that about RW buffers specifically? Because I _have_ used R/O buffers in production code -- when I had a huge string already in memory, and needed various largish substrings of it at different but overlapping times, without paying the overhead to copy them as slicing would have done. Having 'buffer' as a buit-in was quite minor though -- considering the number of times I have used it, importing some module to get at it would have been perfectly acceptable, perhaps preferable. If the buffer interface stays but the function completely disappears, I guess it won't be too hard for me to recreate it in a tiny extension module, but it's not quite clear to me why I should need to. R/W buffers I've never used in production, though. I do recall once (at the very beginning of my Python usage) using an array's buffer_info method as a Q&D way to do some interfacing to C, but that was before ctypes, which I think is what i'd use now. Alex

On Tue, Oct 28, 2003 at 11:23:18PM +0100, Alex Martelli wrote:
Is that about RW buffers specifically?
No.
That's a useful thing to be able to do and the buffer object does it in a safe way. I guess that's part of the reason why the buffer object has managed to survive as long as it has. Neil

Raymond Hettinger wrote:
I trust you will preserve the functionality though? I have used the buffer() function to achieve great leaps in performance in applications which send data from a string buffer to a socket. Slicing kills performance in this scenario once buffer sizes get beyond a few 100 kB. Below is example from an asyncore.dispatcher subclass. This code sends chunks with maximum size, without ever slicing the buffer. def handle_write(self): if self.buffer_offset: sent = self.send(buffer(self.buffer, self.buffer_offset)) else: sent = self.send(self.buffer) self.buffer_offset += sent if self.buffer_offset == len(self.buffer): del self.buffer Troels

Looks like I was a little quick sending out that message. I found more recent postings from Tim and Guido: http://mail.python.org/pipermail/python-dev/2002-July/026408.html http://mail.python.org/pipermail/python-dev/2002-July/026413.html Slippery little beast, that buffer object. :-) I'm going to go ahead and add deprecation warnings. Neil

Neil Schemenauer <nas-python@python.ca> writes:
I used it once in combination with ctypes as buffer(a-ctypes-object) to get at the raw memory whicy ctypes objects expose via the buffer API. But it was pretty obscure, and I would happily have used an external module. Like this:
Basically, the only serious use case is getting the bytes out of objects which support the buffer API but which *don't* offer a "get the bytes out" interface. I've just realised that I could, however, also do this via the array module:
There's an extra copy in there. Disaster :-) Nope, I don't think there's a good use case after all... Paul -- This signature intentionally left blank

Neil Schemenauer <nas-python@python.ca>:
The buffer object, OTOH, is considered fundamentally broken and should be removed.
There's no doubt that the current implementation of it is unacceptably dangerous, but I haven't yet seen an argument that convinces me that it couldn't be fixed if desired. I don't think the *idea* of a buffer object is fundamentally flawed, and it seems potentially useful (although I must admit that I haven't found a need for it myself yet). Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On Wed, Oct 29, 2003 at 02:41:54PM +1300, Greg Ewing wrote:
Okay. Perhaps I am missing something but would fixing it be as simple as adding another field to the tp_as_buffer struct? /* references returned by the buffer functins are valid while * the object remains alive */ #define PyBuffer_FLAG_SAFE 1 Then in stringobject.c (and elsewhere as appropriate): static PyBufferProcs buffer_as_buffer = { (getreadbufferproc)buffer_getreadbuf, (getwritebufferproc)buffer_getwritebuf, (getsegcountproc)buffer_getsegcount, (getcharbufferproc)buffer_getcharbuf, PyBuffer_FLAG_SAFE, }; Then change bufferobject so that it can only be created from objects that set PyBuffer_FLAG_SAFE. Neil

I don't know if this is enough, but if it is, I'd recommend adding the flag bitto tp_flags rather than extending the buffer structure (since you'd need to allocate an extra bit for tp_flags anyway to indicate the longer buffer struct). --Guido van Rossum (home page: http://www.python.org/~guido/)

Neil Schemenauer
As the essence of the solution, I think that sounds good! I think that the following should also be done: * Update the docs for the buffer functions to indicate that these are *short term* pointers, that are not guaranteed once *any* Python code is called. * Add new public buffer functions with "LongTerm" in the name (and docs that buffer is valid as long as the object). These check the flag as you propose. * Buffer object uses new LongTerm buffer functions. It points out that the buffer object itself is less at fault than the interface. I'm trying to short-circuit bugs in external extension modules that use the buffer functions without realizing the subtle assumptions made. Mark.

Neil Schemenauer:
That's completely different from what I had in mind, which was: (1) Keep a reference to the base object in the buffer object, and (2) Use the buffer API to fetch a fresh pointer from the base object each time it's needed. Is there some reason that still wouldn't be safe enough? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

On Thu, Oct 30, 2003 at 03:30:18PM +1300, Greg Ewing wrote:
That's completely different from what I had in mind, which was:
(1) Keep a reference to the base object in the buffer object, and
It already does this.
I don't see any problem with that. It's probably a better solution since it doesn't require a new flag and it lets you create buffers that reference objects like arrays. Neil

On Thu, Oct 30, 2003 at 07:21:01AM -0800, Neil Schemenauer wrote:
I don't see any problem with that.
Okay, small problem. The hash function for the buffer object is brain damaged, in more ways than one actually: >>> import array >>> a = array.array('c') >>> b = buffer(a) >>> hash(b) Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 16384 (LWP 5311)] buffer_hash (self=0x40262d00) at Objects/bufferobject.c:241 241 x = *p << 7; (gdb) l 236 return -1; 237 } 238 239 len = self->b_size; 240 p = (unsigned char *) self->b_ptr; 241 x = *p << 7; 242 while (--len >= 0) 243 x = (1000003*x) ^ *p++; 244 x ^= self->b_size; 245 if (x == -1) (gdb) p len $1 = 0 (gdb) p *p Cannot access memory at address 0x0 The buffer object has 'b_readonly' and 'b_hash' fields. If readonly is true than the object is considered hashable and once computed the hash is stored in the 'hash' field. The problem is that the buffer API doesn't provide a way to determine 'readonly'. The absence of getwritebuf() is not the same thing as being read only. The buffer() builtin always sets the 'readonly' flag! I don't think the buffer hash method can depend on the data being pointed to. There is nothing in the buffer interface that tells you if the data is immutable. The hash method could return the id of the buffer object but I'm not sure how useful that would be. Neil

Neil Schemenauer <nas-python@python.ca>:
How about just having it call the hash method of the base object? If the base object is hashable, this will do something reasonable, and if not, it will fail in the expected way. Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

Greg:
That would work, be less intrusive, and allow all existing code to work unchanged. My only concern is that it does not go anywhere towards fixing the buffer interface itself. To my mind, the buffer object is fairly useless and I never use it - so I really don't care. However, I do have real world uses for the buffer interface. The most compelling is for async IO in the Windows world - I need to pass a buffer Windows will fill in the background, and the buffer interface provides the solution - except for the flaws that also drip down to the buffer object, and leaves us with this problem. Thus, my preference is to fix the buffer object by fixing the interface as much as possible. Here is a sketch of a solution, incorporating both Neil and Greg's ideas: * Type object gets a new flag - TP_HAS_BUFFER_INFO, corresponding to a new 'getbufferinfoproc' slot in the PyBufferProcs structure (note - a function pointer, not static flags as Neil suggested) * New function 'getbufferinfoproc' returns a bitmask - Py_BUFFER_FIXED is one (and currently the only) flag that can be returned. * New buffer functions PyObject_AsFixedCharBuffer, etc. These check the new flag (and a type lacking TP_HAS_BUFFER_INFO is assumed to *not* be fixed) * Buffer object keeps a reference to the existing object (as it does now). Its getbufferinfoproc delegates to the underlying object. * Buffer object *never* keeps a pointer to the buffer - only to the object. Functions like tp_hash always re-fetch the buffer on demand. The buffer returned by the buffer object is then guaranteed to be as reliable as the underlying object. (This may be a semantic issue with hash(), but conceptually seems fine. Potential solution here - add Py_BUFFER_READONLY as a buffer flag, then hash() semantics could do the right thing) After all that, I can't help noticing Greg's solution would be far less work <wink>, Mark.

Hang on, didn't we already go through the process of designing a new buffer interface not long ago? What was decided about the results of that? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+

"Mark Hammond" <mhammond@skippinet.com.au> writes:
I think that is a different issue entirely. While it may be interesting and important, can we at least try to keep them separate? Cheers, mwh -- This is the fixed point problem again; since all some implementors do is implement the compiler and libraries for compiler writing, the language becomes good at writing compilers and not much else! -- Brian Rogoff, comp.lang.functional

"Mark Hammond" <mhammond@skippinet.com.au> writes:
Well, there are two things people complain about a) the buffer INTERFACE b) the buffer OBJECT are the issues plaguing both the same? I wasn't under the impression they were. It's entirely possible I'm wrong, though. Cheers, mwh -- [1] If you're lost in the woods, just bury some fibre in the ground carrying data. Fairly soon a JCB will be along to cut it for you - follow the JCB back to civilsation/hitch a lift. -- Simon Burr, cam.misc

On Fri, Oct 31, 2003 at 09:21:06AM +1100, Mark Hammond wrote:
What does this flag mean? To my mind, there are several different types of memory buffers and the buffer interface does not distinguish between all of them. Is the size and position of the buffer fixed? Is the buffer immutable (it may be readonly by the buffer object but writable via some other mechanism)? The first question can be avoided by using Greg's idea of always refreshing the size and position. The second question cannot be answered using the current interface. I supposed if the buffer is immutable then it is implied that the its size and position is fixed.
You can't use the base objects hash if the buffer has a explicit size of offset. Neil

On Thu, Oct 30, 2003 at 03:30:18PM +1300, Greg Ewing wrote:
I've just uploaded a (rough) patch that implements your idea. http://www.python.org/sf/832058 Neil
participants (11)
-
Alex Martelli
-
Greg Ewing
-
Guido van Rossum
-
Jp Calderone
-
Mark Hammond
-
Michael Hudson
-
Neil Schemenauer
-
Paul Moore
-
Raymond Hettinger
-
Thomas Heller
-
Troels Walsted Hansen