PEP: Adding data-type objects to Python

Travis E. Oliphant wrote:
Two packages need to share a chunk of memory (the package authors do not know each other and only have and Python as a common reference). They both want to describe that the memory they are sharing has some underlying binary structure.
As a quick sanity check, please tell me where I went off track.
it sounds to me like you are assuming that:
(1) The memory chunk represents a single object (probably an array of some sort) (2) That subchunks can themselves be described by a (single?) repeating C struct. (3) You can't just use the C header, since you want this at run-time. (4) It would be enough if you could say
This is an array of 500 elements that look like
struct { int simple; struct nested { char name[30]; char addr[45]; int amount; }
(5) But is it not acceptable to use Martin's suggested ctypes equivalent of (building out from the inside):
class nested(Structure): _fields_ = [("name", c_char*30), ("addr", c_char*45), ("amount", c_long)]
class struct(Structure): _fields_ = [("simple", c_int), ("nested", nested)]
struct * 500
If I misunderstood, could you show me where?
If I did understand correctly, could you expand on why (5) is unacceptable, given that ctypes is now in the core? (New and unknown, I would understand -- but that is also true of any datatype proposal, for the people who haven't already used it. I suspect that any differences from Numpy would be a source of pain for those who *have* used Numpy, but following Numpy exactly is ... not much simpler than the above.)
Or are you just saying that "anything with a buffer interface should also have a datatype object describing the layout in a standard way"? If so, that makes sense, but I'm inclined to prefer the ctypes way, so that most people won't ever have to worry about things like endianness/strides/Fortan layout.
-jJ

Jim Jewett wrote:
Travis E. Oliphant wrote:
Two packages need to share a chunk of memory (the package authors do not know each other and only have and Python as a common reference). They both want to describe that the memory they are sharing has some underlying binary structure.
As a quick sanity check, please tell me where I went off track.
it sounds to me like you are assuming that:
(1) The memory chunk represents a single object (probably an array of some sort) (2) That subchunks can themselves be described by a (single?) repeating C struct. (3) You can't just use the C header, since you want this at run-time. (4) It would be enough if you could say
This is an array of 500 elements that look like
struct { int simple; struct nested { char name[30]; char addr[45]; int amount; }
Sure. I think that's pretty much it. I assume you mean object in the general sense and not as in (Python object).
(5) But is it not acceptable to use Martin's suggested ctypes equivalent of (building out from the inside):
Part of the problem is that ctypes uses a lot of different Python types (that's what I mean by "multi-object" to accomplish it's goal). What I'm looking for is a single Python type that can be passed around and explains binary data.
Remember the buffer protocol is in compiled code. So, as a result,
1) It's harder to construct a class to pass through the protocol using the multiple-types approach of ctypes.
2) It's harder to interpret the object recevied through the buffer protocol.
Sure, it would be *possible* to use ctypes, but I think it would be very difficult. Think about how you would write the get_data_format C function in the extended buffer protocol for NumPy if you had to import ctypes and then build a class just to describe your data. How would you interpret what you get back?
The ctypes "format-description" approach is not as unified as a single Python type object that I'm proposing.
In NumPy, we have a very nice, compact description of complicated data already available. Why not use what we've learned?
I don't think we should just *use ctypes because it's there* when the way it describes binary data was not constructed with the extended buffer protocol in mind.
The other option, of course, which would not introduce a new Python type is to use the array interface specification and pass a list of tuples. But, I think this is also un-necessarily wasteful because the sending object has to construct it and the receiving object has to de-construct it. The whole point of the (extended) buffer protocol is to communicate this information more quickly.
-Travis

Travis Oliphant wrote:
Part of the problem is that ctypes uses a lot of different Python types (that's what I mean by "multi-object" to accomplish it's goal). What I'm looking for is a single Python type that can be passed around and explains binary data.
It's not clear that multi-object is a bad thing in and of itself. It makes sense conceptually -- if you have a datatype object representing a struct, and you ask for a description of one of its fields, which could be another struct or array, you would expect to get another datatype object describing that.
Can you elaborate on what would be wrong with this?
Also, can you clarify whether your objection is to multi-object or multi-type. They're not the same thing -- you could have a data structure built out of multiple objects that are all of the same Python type, with attributes distinguishing between struct, array, etc. That would be single-type but multi-object.
-- Greg

Greg Ewing wrote:
Travis Oliphant wrote:
Part of the problem is that ctypes uses a lot of different Python types (that's what I mean by "multi-object" to accomplish it's goal). What I'm looking for is a single Python type that can be passed around and explains binary data.
It's not clear that multi-object is a bad thing in and of itself. It makes sense conceptually -- if you have a datatype object representing a struct, and you ask for a description of one of its fields, which could be another struct or array, you would expect to get another datatype object describing that.
Can you elaborate on what would be wrong with this?
Also, can you clarify whether your objection is to multi-object or multi-type. They're not the same thing -- you could have a data structure built out of multiple objects that are all of the same Python type, with attributes distinguishing between struct, array, etc. That would be single-type but multi-object.
I've tried to clarify this in another post. Basically, what I don't like about the ctypes approach is that it is multi-type (every new data-format is a Python type).
In order to talk about all these Python types together, then they must all share some attribute (or else be derived from a meta-type in C with a specific function-pointer entry).
I think it is simpler to think of a single Python type whose instances convey information about data-format.
-Travis

Travis Oliphant wrote:
Greg Ewing wrote:
Travis Oliphant wrote:
Part of the problem is that ctypes uses a lot of different Python types (that's what I mean by "multi-object" to accomplish it's goal). What I'm looking for is a single Python type that can be passed around and explains binary data.
It's not clear that multi-object is a bad thing in and of itself. It makes sense conceptually -- if you have a datatype object representing a struct, and you ask for a description of one of its fields, which could be another struct or array, you would expect to get another datatype object describing that.
Yes, exactly. This is what the Python type I'm proposing does as well. So, perhaps we are misunderstanding each other. The difference is that data-types are instances of the data-type (data-format) object instead of new Python types (as they are in ctypes).
I've tried to clarify this in another post. Basically, what I don't like about the ctypes approach is that it is multi-type (every new data-format is a Python type).
I should clarify that I have no opinion about the ctypes approach for what ctypes does with it. I like ctypes and have adapted NumPy to make it easier to work with ctypes.
I'm saying that I don't like the idea of forcing this approach on everybody else who wants to describe arbitrary binary data just because ctypes is included. Now, if it is shown that it is indeed better than a simpler instances-of-a-single-type approach that I'm basically proposing then I'll be persuaded.
However, the existence of an alternative strategy using a single Python type and multiple instances of that type to describe binary data (which is the NumPy approach and essentially the array module approach) means that we can't just a-priori assume that the way ctypes did it is the only or best way.
The examples of "missing features" that Martin has exposed are not show-stoppers. They can all be easily handled within the context of what is being proposed. I can modify the PEP to show this. But, I don't have the time to spend if it's just all going to be rejected in the end. I need some encouragement in order to continue to invest energy in pushing this forward.
-Travis

Travis E. Oliphant wrote:
However, the existence of an alternative strategy using a single Python type and multiple instances of that type to describe binary data (which is the NumPy approach and essentially the array module approach) means that we can't just a-priori assume that the way ctypes did it is the only or best way.
As a hypothetical, what if there was a helper function that translated a description of a data structure using basic strings and sequences (along the lines of what you have in your PEP) into a ctypes data structure?
The examples of "missing features" that Martin has exposed are not show-stoppers. They can all be easily handled within the context of what is being proposed. I can modify the PEP to show this. But, I don't have the time to spend if it's just all going to be rejected in the end. I need some encouragement in order to continue to invest energy in pushing this forward.
I think the most important thing in your PEP is the formats for describing structures in a way that is easy to construct in both C and Python (specifically, by using strings and sequences), and it is worth pursuing for that aspect alone. Whether that datatype is then implemented as a class in its own right or as a factory function that returns a ctypes data type object is, to my mind, a relatively minor implementation issue (either way has questions to be addressed - I'm not sure how you tell ctypes that you have a 32-bit integer with a non-native endian format, for example).
In fact, it may make sense to just use the lists/strings directly as the data exchange format definitions, and let the various libraries do their own translation into their private format descriptions instead of creating a new one-type-to-describe-them-all.
Cheers, Nick.

Nick Coghlan wrote:
Travis E. Oliphant wrote:
However, the existence of an alternative strategy using a single Python type and multiple instances of that type to describe binary data (which is the NumPy approach and essentially the array module approach) means that we can't just a-priori assume that the way ctypes did it is the only or best way.
As a hypothetical, what if there was a helper function that translated a description of a data structure using basic strings and sequences (along the lines of what you have in your PEP) into a ctypes data structure?
That would be fine and useful in fact. I don't see how it helps the problem of "what to pass through the buffer protocol" I see passing c-types type objects around on the c-level as an un-necessary and burdensome approach unless the ctypes objects were significantly enhanced.
In fact, it may make sense to just use the lists/strings directly as the data exchange format definitions, and let the various libraries do their own translation into their private format descriptions instead of creating a new one-type-to-describe-them-all.
Yes, I'm open to this possibility. I basically want two things in the object passed through the extended buffer protocol:
1) It's fast on the C-level 2) It covers all the use-cases.
If just a particular string or list structure were passed, then I would drop the data-format PEP and just have the dataformat argument of the extended buffer protocol be that thing.
Then, something that converts ctypes objects to that special format would be very nice indeed.
-Travis

Travis Oliphant wrote:
Nick Coghlan wrote:
In fact, it may make sense to just use the lists/strings directly as the data exchange format definitions, and let the various libraries do their own translation into their private format descriptions instead of creating a new one-type-to-describe-them-all.
Yes, I'm open to this possibility. I basically want two things in the object passed through the extended buffer protocol:
- It's fast on the C-level
- It covers all the use-cases.
If just a particular string or list structure were passed, then I would drop the data-format PEP and just have the dataformat argument of the extended buffer protocol be that thing.
Then, something that converts ctypes objects to that special format would be very nice indeed.
It may make sense to have a couple distinct sections in the datatype PEP: a. describing data formats with basic Python types b. a lightweight class for parsing these data format descriptions
It's most of the way there already - part A would just be the various styles of arguments accepted by the datatype constructor, and part B would be the datatype object itself.
I personally think it makes the most sense to do both, but separating the two would make it clear that the descriptions can be standardised without *necessarily* defining a new class.
Cheers, Nick.

Travis Oliphant schrieb:
Greg Ewing wrote:
Travis Oliphant wrote:
Part of the problem is that ctypes uses a lot of different Python types (that's what I mean by "multi-object" to accomplish it's goal). What I'm looking for is a single Python type that can be passed around and explains binary data.
It's not clear that multi-object is a bad thing in and of itself. It makes sense conceptually -- if you have a datatype object representing a struct, and you ask for a description of one of its fields, which could be another struct or array, you would expect to get another datatype object describing that.
Can you elaborate on what would be wrong with this?
Also, can you clarify whether your objection is to multi-object or multi-type. They're not the same thing -- you could have a data structure built out of multiple objects that are all of the same Python type, with attributes distinguishing between struct, array, etc. That would be single-type but multi-object.
I've tried to clarify this in another post. Basically, what I don't like about the ctypes approach is that it is multi-type (every new data-format is a Python type).
In order to talk about all these Python types together, then they must all share some attribute (or else be derived from a meta-type in C with a specific function-pointer entry).
(I tried to read the whole thread again, but it is too large already.)
There is a (badly named, probably) api to access information about ctypes types and instances of this type. The functions are PyObject_stgdict(obj) and PyType_stgdict(type). Both return a 'StgDictObject' instance or NULL if the funtion fails. This object is the ctypes' type object's __dict__.
StgDictObject is a subclass of PyDictObject and has fields that carry information about the C type (alignment requirements, size in bytes, plus some other stuff). Also it contains several pointers to functions that implement (in C) struct-like functionality (packing/unpacking).
Of course several of these fields can only be used for ctypes-specific purposes, for example a pointer to the ffi_type which is used when calling foreign functions, or the restype, argtypes, and errcheck fields which are only used when the type describes a function pointer.
This mechanism is probably a hack because it'n not possible to add C accessible fields to type objects, on the other hand it is extensible (in principle, at least).
Just to describe the implementation.
Thomas

Thomas Heller wrote:
(I tried to read the whole thread again, but it is too large already.)
There is a (badly named, probably) api to access information about ctypes types and instances of this type. The functions are PyObject_stgdict(obj) and PyType_stgdict(type). Both return a 'StgDictObject' instance or NULL if the funtion fails. This object is the ctypes' type object's __dict__.
StgDictObject is a subclass of PyDictObject and has fields that carry information about the C type (alignment requirements, size in bytes, plus some other stuff). Also it contains several pointers to functions that implement (in C) struct-like functionality (packing/unpacking).
Of course several of these fields can only be used for ctypes-specific purposes, for example a pointer to the ffi_type which is used when calling foreign functions, or the restype, argtypes, and errcheck fields which are only used when the type describes a function pointer.
This mechanism is probably a hack because it'n not possible to add C accessible fields to type objects, on the other hand it is extensible (in principle, at least).
Thank you for the description. While I've studied the ctypes code, I still don't understand the purposes beind all the data-structures.
Also, I really don't have an opinion about ctypes' implementation. All my comparisons are simply being resistant to the "unexplained" idea that I'm supposed to use ctypes objects in a way they weren't really designed to be used.
For example, I'm pretty sure you were the one who made me aware that you can't just extend the PyTypeObject. Instead you extended the tp_dict of the Python typeObject to store some of the extra information that is needed to describe a data-type like I'm proposing.
So, if you I'm just describing data-format information, why do I need all this complexity (that makes ctypes implementation easier/more natural/etc)? What if the StgDictObject is the Python data-format object I'm talking about? It actually looks closer.
But, if all I want is the StgDictObject (or something like it), then why should I pass around the whole type object?
This is all I'm saying to those that want me to use ctypes to describe data-formats in the extended buffer protocol. I'm not trying to change anything in ctypes.
-Travis

Travis Oliphant schrieb:
For example, I'm pretty sure you were the one who made me aware that you can't just extend the PyTypeObject. Instead you extended the tp_dict of the Python typeObject to store some of the extra information that is needed to describe a data-type like I'm proposing.
So, if you I'm just describing data-format information, why do I need all this complexity (that makes ctypes implementation easier/more natural/etc)? What if the StgDictObject is the Python data-format object I'm talking about? It actually looks closer.
But, if all I want is the StgDictObject (or something like it), then why should I pass around the whole type object?
Maybe you don't need it. ctypes certainly needs the type object because it is also used for constructing instances (while NumPy uses factory functions, IIUC), or for converting 'native' Python object into foreign function arguments.
I know that this doesn't interest you from the NumPy perspective (and I don't want to offend you by saying this).
This is all I'm saying to those that want me to use ctypes to describe data-formats in the extended buffer protocol. I'm not trying to change anything in ctypes.
I don't want to change anything in NumPy, either, and was not the one who suggested to use ctypes objects, although I had thought about whether it would be possible or not.
What I like about ctypes, and dislike about Numeric/Numarry/NumPy is the way C compatible types are defined in ctypes. I find the ctypes way more natural than the numxxx or array module way, but what else would anyone expect from me as the ctypes author...
I hope that a useful interface is developed from your proposals, and will be happy to adapt ctypes to use it or interface ctypes with it if this makes sense.
Thomas

On Oct 31, 2006, at 6:38 PM, Thomas Heller wrote:
This mechanism is probably a hack because it'n not possible to add C accessible fields to type objects, on the other hand it is extensible (in principle, at least).
I better start rewriting PyObjC then :-). PyObjC stores some addition information in the type objects that are used to describe Objective-C classes (such as a reference to the proxied class).
IIRC This has been possible from Python 2.3.
Ronald

Ronald Oussoren schrieb:
On Oct 31, 2006, at 6:38 PM, Thomas Heller wrote:
This mechanism is probably a hack because it'n not possible to add C accessible fields to type objects, on the other hand it is extensible (in principle, at least).
I better start rewriting PyObjC then :-). PyObjC stores some addition information in the type objects that are used to describe Objective-C classes (such as a reference to the proxied class).
IIRC This has been possible from Python 2.3.
I assume you are referring to the code in pyobjc/Modules/objc/objc-class.h ?
If this really is reliable I should better start rewriting ctypes then ;-).
Hm, I always thought there was some additional magic going on with type objects, fields appended dynamically at the end or whatever.
Thomas

On Nov 2, 2006, at 9:35 PM, Thomas Heller wrote:
Ronald Oussoren schrieb:
On Oct 31, 2006, at 6:38 PM, Thomas Heller wrote:
This mechanism is probably a hack because it'n not possible to add C accessible fields to type objects, on the other hand it is extensible (in principle, at least).
I better start rewriting PyObjC then :-). PyObjC stores some addition information in the type objects that are used to describe Objective-C classes (such as a reference to the proxied class).
IIRC This has been possible from Python 2.3.
I assume you are referring to the code in pyobjc/Modules/objc/objc- class.h
Yes.
If this really is reliable I should better start rewriting ctypes then ;-).
Hm, I always thought there was some additional magic going on with type objects, fields appended dynamically at the end or whatever.
There is such magic, but that magic was updated in Python 2.3 to allow type-object extensions like this.
Ronald
participants (7)
-
Greg Ewing
-
Jim Jewett
-
Nick Coghlan
-
Ronald Oussoren
-
Thomas Heller
-
Travis E. Oliphant
-
Travis Oliphant