Mailman 3 idea for data-type (data-format) PEP - Python-Dev

newer
Feature Request: Py_NewInterpreter...

idea for data-type (data-format) PEP

Travis E. Oliphant

1 Nov 2006 1 Nov '06

11:44 a.m.

Thanks for all the comments that have been given on the data-type (data-format) PEP. I'd like opinions on an idea for revising the PEP I have. What if we look at this from the angle of trying to communicate data-formats between different libraries (not change the way anybody internally deals with data-formats). For example, ctypes has one way to internally deal with data-formats (using type objects). NumPy/Numeric has a way to internally deal with data-formats (using PyArray_Descr * structure -- in Numeric it's just a C-structure but in NumPy it's fleshed out further and also a Python object called the data-type). Numarray has a way to internally deal with data-formats (using type objects). The array module has a way to internally deal with data-formats (using a PyArray_Descr * structure -- and character codes to select one). The struct module deals with data-formats using character codes. The PIL deals with data-formats using image modes. PyVTK deals with data-formats using it's own internal objects. MPI deals with data-formats using it's own MPI_DataType structures. This list goes on and on. What I claim is needed in Python (to make it better glue) is to have a standard way to communicate data-format information between these extensions. Then, you don't have to build in support for all the different ways data-formats are represented by different libraries. The library only has to be able to translate their representation to the standard way that Python uses to represent data-format. How is this goal going to be achieved? That is the real purpose of the data-type object I previously proposed. Nick showed that there are two (non-orthogonal) ways to think about this goal. 1) We could define a special string-syntax (or list syntax) that covers every special case. The array interface specification goes this direction and it requires no new Python types. This could also be seen as an extension of the "struct" module to allow for nested structures, etc. 2) We could define a Python object that specifically carries data-format information. There is also a third way (or really 2b) that has been mentioned: take one of the extensions and use what it does to communicate data-format between objects and require all other extensions to conform to that standard. The problem with 2b is that what works inside an extension module may not be the best option when it comes to communicating across multiple extension modules. Certainly none of the extension modules have argued that case effectively. Does that explain the goal of what I'm trying to do better?

Show replies by date

Travis E. Oliphant

1 Nov 1 Nov

11:58 a.m.

Travis E. Oliphant wrote:

...

Thanks for all the comments that have been given on the data-type (data-format) PEP. I'd like opinions on an idea for revising the PEP I have.

...

1) We could define a special string-syntax (or list syntax) that covers every special case. The array interface specification goes this direction and it requires no new Python types. This could also be seen as an extension of the "struct" module to allow for nested structures, etc.

2) We could define a Python object that specifically carries data-format information.

Does that explain the goal of what I'm trying to do better?

In other-words, what I'm saying is I really want a PEP that does this. Could we have a discussion about what the best way to communicate data-format information across multiple extension modules would look like. I'm not saying my (pre-)PEP is best. The point of putting it in it's infant state out there is to get the discussion rolling, not to claim I've got all the answers. It seems like there are enough people who have dealt with this issue that we ought to be able to put something very useful together that would make Python much better glue. -Travis

Thomas Heller

2 Nov 2 Nov

3:15 p.m.

Travis E. Oliphant schrieb:

...

Travis E. Oliphant wrote:

...
Thanks for all the comments that have been given on the data-type (data-format) PEP. I'd like opinions on an idea for revising the PEP I have.

...
1) We could define a special string-syntax (or list syntax) that covers every special case. The array interface specification goes this direction and it requires no new Python types. This could also be seen as an extension of the "struct" module to allow for nested structures, etc.

2) We could define a Python object that specifically carries data-format information.

Does that explain the goal of what I'm trying to do better?

In other-words, what I'm saying is I really want a PEP that does this. Could we have a discussion about what the best way to communicate data-format information across multiple extension modules would look like. I'm not saying my (pre-)PEP is best. The point of putting it in it's infant state out there is to get the discussion rolling, not to claim I've got all the answers.

IIUC, so far the 'data-object' carries information about the structure of the data it describes. Couldn't it go a step further and have also some functionality? Converting the data into a Python object and back? This is what the ctypes SETFUNC and GETFUNC functions do, and what also is implemented in the struct module... Thomas

Travis Oliphant

5:30 p.m.

...

IIUC, so far the 'data-object' carries information about the structure of the data it describes.

Couldn't it go a step further and have also some functionality? Converting the data into a Python object and back?

Yes, I had considered it to do that. That's why the setfunc and getfunc functions were written the way they were. -teo

"Martin v. Löwis"

1 Nov 1 Nov

12:48 p.m.

Travis E. Oliphant schrieb:

...

What if we look at this from the angle of trying to communicate data-formats between different libraries (not change the way anybody internally deals with data-formats).

ISTM that this is not the right approach. If the purpose of the datatype object is just to communicate the layout in the extended buffer interface, then it should be specified in that PEP, rather than being stand-alone, and it should not pretend to serve any other purpose. Or, if it does have uses independent of the buffer extension: what are those uses?

...

1) We could define a special string-syntax (or list syntax) that covers every special case. The array interface specification goes this direction and it requires no new Python types. This could also be seen as an extension of the "struct" module to allow for nested structures, etc.

2) We could define a Python object that specifically carries data-format information.

To distinguish between these, convenience of usage (and of construction) should have to be taken into account. At least for the preferred alternative, but better for the runners-up, too, there should be a demonstration on how existing modules have to be changed to support it (e.g. for the struct and array modules as producers; not sure what good consumer code would be). Suppose I wanted to change all RGB values to a gray value (i.e. R=G=B), what would the C code look like that does that? (it seems now that the primary purpose of this machinery is image manipulation)

...

The problem with 2b is that what works inside an extension module may not be the best option when it comes to communicating across multiple extension modules. Certainly none of the extension modules have argued that case effectively.

I think there are two ways in which one option could be "better" than the other: it might be more expressive, and it might be easier to use. For the second aspect (ease of use), there are two subways: it might be easier to produce, or it might be easier to consume. Regards, Martin

Travis E. Oliphant

1:30 p.m.

Martin v. Löwis wrote:

...

Travis E. Oliphant schrieb:

...
What if we look at this from the angle of trying to communicate data-formats between different libraries (not change the way anybody internally deals with data-formats).

ISTM that this is not the right approach. If the purpose of the datatype object is just to communicate the layout in the extended buffer interface, then it should be specified in that PEP, rather than being stand-alone, and it should not pretend to serve any other purpose.

I'm actually quite fine with that. If that is the consensus, then I will just go that direction. ISTM though that since we are putting forth the trouble inside the extended buffer protocol we might as well be as complete as we know how to be.

...

Or, if it does have uses independent of the buffer extension: what are those uses?

So that NumPy and ctypes and audio libraries and video libraries and database libraries and image-file format libraries can communicate about data-formats using the same expressions (in Python). Maybe we decide that ctypes-based expressions are a very good way to communicate about those things in Python for all other packages. If that is the case, then I argue that we ought to change the array module, and the struct module to conform (of course keeping the old ways for backward compatibility) and set the standard for other packages to follow. What problem do you have in defining a standard way to communicate about binary data-formats (not just images)? I still can't figure out why you are so resistant to the idea. MPI had to do it.

...

...
1) We could define a special string-syntax (or list syntax) that covers every special case. The array interface specification goes this direction and it requires no new Python types. This could also be seen as an extension of the "struct" module to allow for nested structures, etc.

2) We could define a Python object that specifically carries data-format information.

To distinguish between these, convenience of usage (and of construction) should have to be taken into account. At least for the preferred alternative, but better for the runners-up, too, there should be a demonstration on how existing modules have to be changed to support it (e.g. for the struct and array modules as producers; not sure what good consumer code would be).

Absolutely --- if something is to be made useful across packages and from Python. This is where the discussion should take place. The struct module and array modules would both be consumers also so that in the struct module you could specify your structure in terms of the standard data-represenation and in the array module you could specify your array in terms of the standard representation instead of using "character codes".

...

Suppose I wanted to change all RGB values to a gray value (i.e. R=G=B), what would the C code look like that does that? (it seems now that the primary purpose of this machinery is image manipulation)

For me it is definitely not image manipulation that is the only purpose (or even the primary purpose). It's just an easy one to explain --- most people understand images). But, I think this question is actually irrelevant (IMHO). To me, how you change all RGB values to gray would depend on the library you are using not on how data-formats are expressed. Maybe we are still mis-understanding each other. If you really want to know. In NumPy it might look like this: Python code: img['r'] = img['g'] img['b'] = img['g'] C-code: use the Python C-API to do essentially the same thing as above or to do img['r'] = img['g'] dtype = img->descr; r_field = PyDict_GetItemString(dtype,'r'); g_field = PyDict_GetItemString(dtype,'g'); r_field_dtype = PyTuple_GET_ITEM(r_field, 0); r_field_offset = PyTuple_GET_ITEM(r_field, 1); g_field_dtype = PyTuple_GET_ITEM(g_field, 0); g_field_offset = PyTuple_GET_ITEM(g_field, 1); obj = PyArray_GetField(img, g_field, g_field_offset); Py_INCREF(r_field) PyArray_SetField(img, r_field, r_field_offset, obj); But, I still don't see how that is relevant to the question of how to represent the data-format to share that information across two extensions.

...

...
The problem with 2b is that what works inside an extension module may not be the best option when it comes to communicating across multiple extension modules. Certainly none of the extension modules have argued that case effectively.

I think there are two ways in which one option could be "better" than the other: it might be more expressive, and it might be easier to use. For the second aspect (ease of use), there are two subways: it might be easier to produce, or it might be easier to consume.

I like this as a means to judge a data-format representation. Let me summarize to see if I understand: 1) Expressive (does it express every data-format you might want or need) 2) Ease of use a) Production: How easy is it to create the representation. b) Consumption: How easy is it to interpret the representation. -Travis

"Martin v. Löwis"

2:49 p.m.

Travis E. Oliphant schrieb:

...

...
Or, if it does have uses independent of the buffer extension: what are those uses?

So that NumPy and ctypes and audio libraries and video libraries and database libraries and image-file format libraries can communicate about data-formats using the same expressions (in Python).

I find that puzzling. In what way can the specification of a data type enable communication? Don't you need some kind of protocol for it (i.e. operations to be invoked)? Also, do you mean that these libraries can communicate with each other? Or with somebody else? If so, with whom?

...

What problem do you have in defining a standard way to communicate about binary data-formats (not just images)? I still can't figure out why you are so resistant to the idea. MPI had to do it.

I'm afraid of "dead" specifications, things whose only motivation is that they look nice. They are just clutter. There are a few examples of this already in Python, like the character buffer interface or the multi-segment buffers. As for MPI: It didn't just independently define a data types system. Instead, it did that, *and* specified the usage of the data types in operations such as MPI_SEND. It is very clear what the scope of this data description is, and what the intended usage is. Without specifying an intended usage, it is impossible to evaluate whether the specification meets its goals.

...

Absolutely --- if something is to be made useful across packages and from Python. This is where the discussion should take place. The struct module and array modules would both be consumers also so that in the struct module you could specify your structure in terms of the standard data-represenation and in the array module you could specify your array in terms of the standard representation instead of using "character codes".

Ok, that would be a new usage: I expected that datatype instances always come in pairs with memory allocated and filled according to the description. If you are proposing to modify/extend the API of the struct and array modules, you should say so somewhere (in a PEP).

...

...
Suppose I wanted to change all RGB values to a gray value (i.e. R=G=B), what would the C code look like that does that? (it seems now that the primary purpose of this machinery is image manipulation)

For me it is definitely not image manipulation that is the only purpose (or even the primary purpose). It's just an easy one to explain --- most people understand images). But, I think this question is actually irrelevant (IMHO). To me, how you change all RGB values to gray would depend on the library you are using not on how data-formats are expressed.

Maybe we are still mis-understanding each other.

I expect that the primary readers/users of the PEP would be people who have to write libraries: i.e. people implementing NumPy, struct, array, and people who implement algorithms that operate on data. So usability of the specification is a matter of how easy it is to *write* a library that does perform the image manipulation.

...

If you really want to know. In NumPy it might look like this:

Python code:

img['r'] = img['g'] img['b'] = img['g']

That's not what I'm asking. Instead, what does the NumPy code look like that gets invoked on these read-and-write operations? Does it only use the void* pointing to the start of the data, and the datatype object? If not, how would C code look like that only has the void* and the datatype object?

...

dtype = img->descr;

In this code, is descr a datatype object? ...

...

r_field = PyDict_GetItemString(dtype,'r');

... I guess not, because apparently, it is a dictionary, not a datatype object.

...

But, I still don't see how that is relevant to the question of how to represent the data-format to share that information across two extensions.

Well, if NumPy gets the data from a different module, it can't assume there is a descr object that is a dictionary. Instead, it must perform these operations just by using the datatype object. What else is the purpose of sharing the information, if not to use it to access the data? Regards, Martin

Travis Oliphant

3:18 p.m.

Martin v. Löwis wrote:

...

Travis E. Oliphant schrieb:

...
...
Or, if it does have uses independent of the buffer extension: what are those uses?

So that NumPy and ctypes and audio libraries and video libraries and database libraries and image-file format libraries can communicate about data-formats using the same expressions (in Python).

I find that puzzling. In what way can the specification of a data type enable communication? Don't you need some kind of protocol for it (i.e. operations to be invoked)? Also, do you mean that these libraries can communicate with each other? Or with somebody else? If so, with whom?

What is puzzling? I've just specified the extended buffer protocol as something concrete that data-format objects are shared through. That's on the C-level. I gave several examples of where such sharing would be useful. Then, I gave examples in Python of how sharing data-formats would also be useful so that modules could support the same means to construct data-formats (instead of struct using strings, array using typecodes, ctypes using it's type-objects, and NumPy using dtype objects).

...

...
What problem do you have in defining a standard way to communicate about binary data-formats (not just images)? I still can't figure out why you are so resistant to the idea. MPI had to do it.

I'm afraid of "dead" specifications, things whose only motivation is that they look nice. They are just clutter. There are a few examples of this already in Python, like the character buffer interface or the multi-segment buffers.

O.K. I can understand that concern. But, all you do is make struct, array, and ctypes support the same data-format specification (by support I mean have a way to "consume" and "produce" the data-format object to the natural represenation that they have internally) and you are guaranteed it won't "die." In fact, what would be ideal is for the PIL, NumPy, CVXOpt, PyMedia, PyGame, pyre, pympi, PyVoxel, etc., etc. (there really are many modules that should be able to talk to each other more easily) to all support the same data-format representations. Then, you don't have to learn everybody's re-invention of the same concept whenever you encounter a new library that does something with binary data. How much time do you actually spend with binary data (sound, video, images, just plain numbers from a scientific experiment) and trying to use multiple Python modules to manipulate it? If you don't spend much time, then I can understand why you don't understand the need.

...

As for MPI: It didn't just independently define a data types system. Instead, it did that, *and* specified the usage of the data types in operations such as MPI_SEND. It is very clear what the scope of this data description is, and what the intended usage is.

Without specifying an intended usage, it is impossible to evaluate whether the specification meets its goals.

What is not understood about the intended usage in the extended buffer protocol. What is not understood about the intended usage of giving the array and struct modules a uniform way to represent binary data?

...

Ok, that would be a new usage: I expected that datatype instances always come in pairs with memory allocated and filled according to the description. To me that is the most important usage, but it's not the *only* one.

...

If you are proposing to modify/extend the API of the struct and array modules, you should say so somewhere (in a PEP).

Sure, I understand that. But, if there is no data-format object, then there is no PEP to "extend the struct and array modules" to support it. Chicken before the egg, and all that.

...

I expect that the primary readers/users of the PEP would be people who have to write libraries: i.e. people implementing NumPy, struct, array, and people who implement algorithms that operate on data.

Yes, but not only them. If it's a default way to represent data, then *users* of those libraries that "consume" the representation would also benefit by learning a standard.

...

So usability of the specification is a matter of how easy it is to *write* a library that does perform the image manipulation.

...
If you really want to know. In NumPy it might look like this:

Python code:

img['r'] = img['g'] img['b'] = img['g']

That's not what I'm asking. Instead, what does the NumPy code look like that gets invoked on these read-and-write operations? Does it only use the void* pointing to the start of the data, and the datatype object? If not, how would C code look like that only has the void* and the datatype object?

...
dtype = img->descr;

In this code, is descr a datatype object? ...

Yes. But, I have a mistake later...

...

...
r_field = PyDict_GetItemString(dtype,'r');

Actually it should read PyDict_GetItemString(dtype->fields). The r_field is a tuple (data-type object, offset). The fields attribute is (currently) a Python dictionary.

...

... I guess not, because apparently, it is a dictionary, not

a datatype object.

Sorry for the confusion.

...

...
But, I still don't see how that is relevant to the question of how to represent the data-format to share that information across two extensions.

Well, if NumPy gets the data from a different module, it can't assume there is a descr object that is a dictionary. Instead, it must perform these operations just by using the datatype object.

Right. I see. Again, I made a mistake in the code. img->descr is a data-type object in NumPy. img->descr->fields is a dictionary of fields keyed by 'name' and returning a tuple (data-type object, offset) But, the other option (especially for code already written) would be to just convert the data-format specification into it's own internal representation. This is the case that I was thinking about when I said it didn't matter how the library operated on the data. If new code wanted to use the data-format object as *the* internal representation, then it would matter.

...

What else is the purpose of sharing the information, if not to use it to access the data?

Of course. I'm sorry my example was incorrect. I guess this falls under the category of "ease of use". If the data-type format can *be* the internal representation, then ease of use is *optimal* because no translation is required. In my ideal world that's the way it would be. But, even if we can't get there immediately, we can at least define a standard for communication.

"Martin v. Löwis"

3:54 p.m.

Travis Oliphant schrieb:

...

...
...
r_field = PyDict_GetItemString(dtype,'r');

Actually it should read PyDict_GetItemString(dtype->fields). The r_field is a tuple (data-type object, offset). The fields attribute is (currently) a Python dictionary.

Ok. This seems to be missing in the PEP. The section titled "Attributes" seems to talk about Python-level attributes. Apparently, you are suggesting that there is also a C-level API, lower than PyObject_GetAttrString, so that you can write dtype->fields, instead of having to write PyObject_GetAttrString(dtype, "fields"). If it is indeed the intend that this kind of acccess is available for datatype objects, then the PEP should specify it. Notice that it would be uncommon for a type in Python: Most types have getter functions (such as PyComplex_RealAsDouble, rather then specifying direct access through obj->cval.real). Going now back to your original code (and assuming proper adjustments): dtype = img->descr; r_field = PyDict_GetItemString(dtype,'r'); g_field = PyDict_GetItemString(dtype,'g'); r_field_dtype = PyTuple_GET_ITEM(r_field, 0); r_field_offset = PyTuple_GET_ITEM(r_field, 1); g_field_dtype = PyTuple_GET_ITEM(g_field, 0); g_field_offset = PyTuple_GET_ITEM(g_field, 1); obj = PyArray_GetField(img, g_field, g_field_offset); Py_INCREF(r_field) PyArray_SetField(img, r_field, r_field_offset, obj); In this code, where is PyArray_GetField coming from? What does it do? If I wanted to write this code from scratch, what should I write instead? Since this is all about a flat memory block, I'm surprised I need "true" Python objects for the field values in there.

...

But, the other option (especially for code already written) would be to just convert the data-format specification into it's own internal representation.

Ok, so your assumption is that consumers already have their own machinery, in which case ease-of-use would be the question how difficult it is to convert datatype objects into the internal representation. Regards, Martin

Travis Oliphant

4:41 p.m.

Martin v. Löwis wrote:

...

Travis Oliphant schrieb:

...
...
...
r_field = PyDict_GetItemString(dtype,'r');

Actually it should read PyDict_GetItemString(dtype->fields). The r_field is a tuple (data-type object, offset). The fields attribute is (currently) a Python dictionary.

Ok. This seems to be missing in the PEP.

Yeah, actually quite a bit is missing. Because I wanted to float the idea for discussion before "getting the details perfect" (which of course they wouldn't be if it was just my input producing them).

...

In this code, where is PyArray_GetField coming from?

This is a NumPy Specific C-API. That's why I was confused about why you wanted me to show how I would do it. But, what you are actually asking is how would another application use the data-type information to do the same thing using the data-type object and a pointer to memory. Is that correct? This is a reasonable thing to request. And your example is a good one. I will use the PEP to explain it. Ultimately, the code you are asking for will have to have some kind of dispatch table for different binary code depending on the actual data-types being shared (unless all that is needed is a copy in which case just the size of the element area can be used). In my experience, the dispatch table must be present for at least the "simple" data-types. The data-types built up from there can depend on those. In NumPy, the data-type objects have function pointers to accomplish all the things NumPy does quickly. So, each data-type object in NumPy points to a function-pointer table and the NumPy code defers to it to actually accomplish the task (much like Python really). Not all libraries will support working with all data-types. If they don't support it, they just raise an error indicating that it's not possible to share that kind of data.

...

What does it do? If I wanted to write this code from scratch, what should I write instead? Since this is all about a flat memory block, I'm surprised I need "true" Python objects for the field values in there.

Well, actually, the block could be "strided" as well. So, you would write something that gets the pointer to the memory and then gets the extended information (dimensionality, shape, and strides, and data-format object). Then, you would get the offset of the field you are interested in from the start of the element (it's stored in the data-format representation). Then do a memory copy from the right place (using the array iterator in NumPy you can actually do it without getting the shape and strides information first but I'm holding off on that PEP until an N-d array is proposed for Python). I'll write something like that as an example and put it in the PEP for the extended buffer protocol. -Travis

...

...
But, the other option (especially for code already written) would be to just convert the data-format specification into it's own internal representation.

Ok, so your assumption is that consumers already have their own machinery, in which case ease-of-use would be the question how difficult it is to convert datatype objects into the internal representation.

Regards, Martin

Greg Ewing

5 Nov 5 Nov

7:07 p.m.

Travis Oliphant wrote:

...

In NumPy, the data-type objects have function pointers to accomplish all the things NumPy does quickly.

If the datatype object is to be extracted and made a stand-alone feature, that might need to be refactored. Perhaps there could be a facility for traversing a datatype with a user-supplied dispatch table? -- Greg

Alexander Belopolsky

1 Nov 1 Nov

4:05 p.m.

Martin v. Löwis writes:

...

I'm afraid of "dead" specifications, things whose only motivation is that they look nice. They are just clutter. There are a few examples of this already in Python, like the character buffer interface or the multi-segment buffers.

Multi-segment buffers are only dead because standard library modules do not support them. I often work with text data that is represented as an array of strings. I would love to implement a multi-segment buffer interface on top of that data and be able to do a full text regular expression search without having to concatenate into one big string, but python's re module would not take a multi-segment buffer.

"Martin v. Löwis"

4 Nov 4 Nov

3:15 a.m.

Alexander Belopolsky schrieb:

...

Multi-segment buffers are only dead because standard library modules do not support them.

That, in turn, is because nobody has contributed code to make that work. My guess is that people either don't need it, or find it too difficult to implement. In any case, it is an important point that such a specification is likely dead if the standard library doesn't support it throughout, from start. So for this PEP, the same criterion likely applies: it's not sufficient to specify an interface, one also has to specify (and then implement) how that affects modules and types of the standard library.

...

I often work with text data that is represented as an array of strings. I would love to implement a multi-segment buffer interface on top of that data and be able to do a full text regular expression search without having to concatenate into one big string, but python's re module would not take a multi-segment buffer.

If you are curious, try adding such a feature to re some time. I expect that implementing it would be quite involved. I wonder what Fredrik Lundh thinks about providing such a feature. Regards, Martin

Alexander Belopolsky

12:13 p.m.

On Nov 4, 2006, at 3:15 AM, Martin v. Löwis wrote:

...

Alexander Belopolsky schrieb:

...
Multi-segment buffers are only dead because standard library modules do not support them.

That, in turn, is because nobody has contributed code to make that work. My guess is that people either don't need it, or find it too difficult to implement.

Last time I tried to contribute code related to buffer protocol, it was rejected with little discussion http://sourceforge.net/tracker/index.php? func=detail&aid=1539381&group_id=5470&atid=305470 that patch implemented two features: enabled creation of read-write buffer objects and added readinto method to StringIO. The resolution was: """ The file object's readinto method is not meant for public use, so adding the method to StringIO is not a good idea. """ The read-write buffer part was not discussed, but I guess the resolution would be that buffer objects are deprecated, so adding features to them is not a good idea.

...

If you are curious, try adding such a feature to re some time. I expect that implementing it would be quite involved. I wonder what Fredrik Lundh thinks about providing such a feature.

I would certainly invest some time into that if that feature had a chance of being accepted. At the moment I feel that anything related to buffers or buffer protocol is met with strong opposition. I think the opposition is mostly fueled by the belief that buffer objects are "unsafe" and buffer protocol is deprecated. None of these premises is correct AFAIK.

Alexander Belopolsky

1 Nov 1 Nov

3:52 p.m.

Travis E. Oliphant writes:

...

What if we look at this from the angle of trying to communicate data-formats between different libraries (not change the way anybody internally deals with data-formats).

For example, ctypes has one way to internally deal with data-formats (using type objects).

NumPy/Numeric has a way to internally deal with data-formats (using PyArray_Descr * structure -- in Numeric it's just a C-structure but in NumPy it's fleshed out further and also a Python object called the data-type).

Ctypes and NumPy's Array Interface address two different needs. When using ctypes, producers of type information are at the Python level, but Array Interface information is produced in C code. It is very convenient to write c_int*2*3 to specify a 2x3 integer matrix in Python, but it is much easier to set type code to 'i' and populate the shape array with integers in C. Consumers of type information are at the C level in both ctypes and Array Interface applications, but in the case of ctypes, users are not expected to write C code. It is typical for an array interface consumer to switch on the type code. Single character (or numeric) type codes are much more convenient than verbose type names in this case. I have used Array Interface extensively, but only for simple types and I have studied ctypes from Python level, but not from C level. I think the standard data type description object should build on the strengths of both approaches. I believe the first step should be to agree on a representation of simple types. Just an agreement on the standard type codes that every module could use would be a great improvement. (Personally, I don't need anything else from array interface.) I don't like letter codes, however. I would prefer to use an enum at the C level and verbose names at Python level. I would also like to mention one more difference between NumPy datatypes and ctypes that I did not see discussed. In ctypes arrays of different shapes are represented using different types. As a result, if the object exporting its buffer is resized, the datatype object cannot be reused, it has to be replaced.

"Martin v. Löwis"

4:13 p.m.

Alexander Belopolsky schrieb:

...

I would also like to mention one more difference between NumPy datatypes and ctypes that I did not see discussed. In ctypes arrays of different shapes are represented using different types. As a result, if the object exporting its buffer is resized, the datatype object cannot be reused, it has to be replaced.

That's also an interesting issue for the datatypes PEP: are datatype objects meant to be immutable? This is particularly interesting for the extended buffer protocol: how long can one keep the data you get from bt_getarrayinfo? Also, how does the memory management work for the results? Regards, Martin

Alexander Belopolsky

4:58 p.m.

On 11/1/06, "Martin v. Löwis" wrote:

...

That's also an interesting issue for the datatypes PEP: are datatype objects meant to be immutable?

That's a question for Travis, but I would think that they would be immutable at the Python level, but mutable at the C level. In Travis' approach array size is not stored in the datatype, so I don't see much need to modify datatype objects in-place. It may be reasonable to allow adding fields to a record, but I don't have enough experience with that to comment.

...

This is particularly interesting for the extended buffer protocol: how long can one keep the data you get from bt_getarrayinfo?

I think your question is limited to shape and strides outputs because dataformat is a reference counted PyObject (and PEP should specify whether it is a borrowed reference). And the answer is the same as for the data from bf_getreadbuffer/bf_getwritebuffer . AFAIK, existing buffer protocol does not answer this question delegating it to the extension module writers who provide objects exporting their buffers.

...

Also, how does the memory management work for the results?

I think it is implied that all pointers are borrowed references. I could not find any discussion of memory management in the current buffer protocol documentation. This is a good question. It may be the case that the shape or stride information is not available as Py_intptr_t array inside the object that wants to export its memory buffer. This is not theoretical, I have a 64-bit application that uses objects that keep their size information in a 32-bit int. BTW, I think the memory management issues with the buffer objects have been resolved at some point. Any lessons to learn from that?

Greg Ewing

2 Nov 2 Nov

8:04 p.m.

Alexander Belopolsky wrote:

...

That's a question for Travis, but I would think that they would be immutable at the Python level, but mutable at the C level.

Well, anything's mutable at the C level -- the question is whether you *should* be mutating it. I think the datatype object should almost certainly be immutable. Since it's separated from the data it's describing, it's possible for one datatype object to describe multiple chunks of data. So you wouldn't want to mutate one in case it's being used for something else that you don't know about. -- Greg

Alexander Belopolsky

8:55 p.m.

On Nov 2, 2006, at 8:04 PM, Greg Ewing wrote:

...

I think the datatype object should almost certainly be immutable. Since it's separated from the data it's describing, it's possible for one datatype object to describe multiple chunks of data. So you wouldn't want to mutate one in case it's being used for something else that you don't know about.

I only mentioned that the datatype object would be mutable at C level because changing the object instead of deleting and creating a new one could be a valid optimization in situations where the object is know not to be shared. My main concern was that in ctypes the size of an array is a part of the datatype object and this seems to be redundant if used for the buffer protocol. Buffer protocol already reports the size of the buffer as a return value of bf_get*buffer methods. In another post, Greg Ewing wrote:

...

...
numpy.array(array.array('d',[1,2,3]))

and leave-out the buffer object all together.

...

I think the buffer object in his example was just a placeholder for "some arbitrary object that supports the buffer interface", not necessarily another NumPy array.

Yes, thanks. In fact numpy.array(array.array('d',[1,2,3])) already works in numpy (I think because numpy knows about the standard library array type). In my example, I wanted to use an object that supports buffer protocol and little else.

Greg Ewing

9:25 p.m.

Alexander Belopolsky wrote:

...

My main concern was that in ctypes the size of an array is a part of the datatype object and this seems to be redundant if used for the buffer protocol. Buffer protocol already reports the size of the buffer as a return value of bf_get*buffer methods.

I think what would happen if you were interoperating with ctypes is that you would get a datatype describing one element of the array, together with the shape information, and construct a ctypes array type from that. And going the other way, from a ctypes array type you would extract an element datatype and a shape. -- Greg

Alexander Belopolsky

10:36 p.m.

On Nov 2, 2006, at 9:25 PM, Greg Ewing wrote:

...

I think what would happen if you were interoperating with ctypes is that you would get a datatype describing one element of the array, together with the shape information, and construct a ctypes array type from that. And going the other way, from a ctypes array type you would extract an element datatype and a shape.

Correct, assuming Travis' approach is accepted. However I understood that Martin was suggesting that ctypes types should be used to describe the structure of the buffer. Thus a buffer containing 10 integers would report its datatype as c_int*10. I was probably mistaken and Martin was suggesting the same as you. In this case extended buffer protocol would still use a different model from ctype and "don't reinvent the wheel" argument goes away.

Greg Ewing

6:52 p.m.

Alexander Belopolsky wrote:

...

In ctypes arrays of different shapes are represented using different types. As a result, if the object exporting its buffer is resized, the datatype object cannot be reused, it has to be replaced.

I was thinking about that myself the other day. I was thinking that both ctypes and NumPy arrays + proposed_type_descriptor provide a way of describing an array of binary data and providing Python-level access to that data. So a NumPy array and an instance of a ctypes type that happens to describe an array are very similar things. I was wondering whether they could be unified somehow. But then I realised that the ctypes array is a fixed-size array, whereas NumPy's notion of an array is rather more flexible. So they're not really the same thing after all. However, the *elements* of the array are fixed size in both cases, so the respective descriptions of the element type could potentially have something in common. My current take on the situation is that Travis is probably right about ctypes types being too cumbersome for what he has in mind. The next best thing would be to make them interoperate: have an easy way of getting a ctypes type corresponding to a given data layout description and vice versa. -- Greg

6378

Age (days ago)

6383

Last active (days ago)

List overview

Download

21 comments

7 participants

participants (7)

"Martin v. Löwis"
Alexander Belopolsky
Greg Ewing
Thomas Heller
Travis E. Oliphant
Travis Oliphant
Travis Oliphant

idea for data-type (data-format) PEP

tags

participants (7)