PEP: Adding data-type objects to Python
I'm still not sure exactly what is missing from ctypes. To make this concrete: You have an array of 500 elements meeting struct { int simple; struct nested { char name[30]; char addr[45]; int amount; } ctypes can describe this as class nested(Structure): _fields_ = [("name", c_char*30), ("addr", c_char*45), ("amount", c_long)] class struct(Structure): _fields_ = [("simple", c_int), ("nested", nested)] desc = struct * 500 You have said that creating whole classes is too much overhead, and the description should only be an instance. To me, that particular class (arrays of 500 structs) still looks pretty lightweight. So please clarify when it starts to be a problem. (1) For simple types -- mapping char name[30]; ==> ("name", c_char*30) Do you object to using the c_char type? Do you object to the array-of-length-30 class, instead of just having a repeat or shape attribute? Do you object to naming the field? (2) For the complex types, nested and struct Do you object to creating these two classes even once? For example, are you expecting to need different classes for each buffer, and to have many buffers created quickly? Is creating that new class a royal pain, but frequent (and slow) enough that you can't just make a call into python (or ctypes)? (3) Given that you will describe X, is X*500 (==> a type describing an array of 500 Xs) a royal pain in C? If so, are you expecting to have to do it dynamically for many sizes, and quickly enough that you can't just let ctypes do it for you? -jJ
Jim Jewett wrote:
I'm still not sure exactly what is missing from ctypes. To make this concrete:
I think the only thing missing from ctypes "expressiveness" as far as I can tell in terms of what you "can" do is the byte-order representation. What is missing is ease-of use for producers and consumers in interpreting the data-type. When I speak of Producers and consumers, I'm largely talking about C-code (or Java or .NET) code writers. Producers must basically use Python code to create classes of various types. This is going to be slow in 'C'. Probably slower than the array interface (which is what we have no informally). Consumers are going to have a hard time interpreting the result. I'm not even sure how to do that, in fact. I'd like NumPy to be able to understand ctypes as a means to specify data. Would I have to check against all the sub-types of CDataType, pull out the fields, check the tp_name of the type object? I'm not sure. It seems like a string with the C-structure would be better as a data-representation, but then a third-party library would want to parse that so that Python might as well have it's own parser for data-types. So, Python might as well have it's own way to describe data. My claim is this default way should *not* be overloaded by using Python type-objects (the ctypes way). I'm making a claim that the NumPy way of using a different Python object to describe data-types. I'm not saying the NumPy object should be used. I'm saying we should come up with a singe DataFormatType whose instances express the data formats in ways that other packages can produce and consume (or even use internally). It would be easy for NumPy to "use" the default Python object in it's PyArray_Descr * structure. It would also be easy for ctypes to "use" the default Python object in its StgDict object that is the tp_dict of every ctypes type object. It would be easy for the struct module to allow for this data-format object (instead of just strings) in it's methods. It would be easy for the array module to accept this data-format object (instead of just typecodes) in it's constructor. Lot's of things would suddenly be more consistent throughout both the Python and C-Python user space. Perhaps after discussion, it becomes clear that the ctypes approach is sufficient to be "that thing" that all modules use to share data-format information. It's definitely expressive enough. But, my argument is that NumPy data-type objects are also "pretty close." so why should they be rejected. We could also make a "string-syntax" do it.
You have said that creating whole classes is too much overhead, and the description should only be an instance. To me, that particular class (arrays of 500 structs) still looks pretty lightweight. So please clarify when it starts to be a problem.
(1) For simple types -- mapping char name[30]; ==> ("name", c_char*30)
Do you object to using the c_char type? Do you object to the array-of-length-30 class, instead of just having a repeat or shape attribute? Do you object to naming the field?
(2) For the complex types, nested and struct
Do you object to creating these two classes even once? For example, are you expecting to need different classes for each buffer, and to have many buffers created quickly? I object to the way I "consume" and "produce" the ctypes interface. It's much to slow to be used on the C-level for sharing many small buffers quickly.
Is creating that new class a royal pain, but frequent (and slow) enough that you can't just make a call into python (or ctypes)?
(3) Given that you will describe X, is X*500 (==> a type describing an array of 500 Xs) a royal pain in C? If so, are you expecting to have to do it dynamically for many sizes, and quickly enough that you can't just let ctypes do it for you?
That pretty much sums it up (plus the pain of having to basically write Python code from "C"). -Travis
Jim Jewett wrote:
I'm still not sure exactly what is missing from ctypes. To make this concrete:
I was too hasty. There are some things actually missing from ctypes: 1) long double (this is not the same across platforms, but it is a data-type). 2) complex-valued types (you might argue that it's just a 2-array of floats, but you could say the same thing about int as an array of bytes). The point is how do people interpret the data. Complex-valued data-types are very common. It is one reason Fortran is still used by scientists. 3) Unicode characters (there is w_char support but I mean a way to describe what kind of unicode characters you have in a cross-platform way). I actually think we have a way to describe encodings in the data-format representation as well. 4) What about floating-point representations that are not IEEE 754 4-byte or 8-byte. There should be a way to at least express the data-format in these cases (this is actually how long double should be handled as well since it varies across platforms what is actually done with the extra bits). So, we can't "just use ctypes" as a complete data-format representation because it's also missing some things. What we need is a standard way for libraries that deal with data-formats to communicate with each other. I need help with a PEP like this and that's what I'm asking for. It's all I've really been after all along. A couple of points: * One reason to support the idea of the Python object approach (versus a string-syntax) is that it "is already parsed". A list-syntax approach (perhaps built from strings for fundamental data-types) might also be considered "already parsed" as well. * One advantage of using "kind" versus a character for every type (like struct and array do) is that it helps consumers and producers speed up the parser (a fuller branching tree). -Travis
Travis E. Oliphant schrieb:
I was too hasty. There are some things actually missing from ctypes:
I think Thomas can correct me if I'm wrong: I think endianness is supported (although this support seems undocumented). There seems to be code that checks for the presence of a _byteswapped_ attribute on fields of a struct; presence of this field is then interpreted as data having the "other" endianness.
1) long double (this is not the same across platforms, but it is a data-type).
That's indeed missing.
2) complex-valued types (you might argue that it's just a 2-array of floats, but you could say the same thing about int as an array of bytes). The point is how do people interpret the data. Complex-valued data-types are very common. It is one reason Fortran is still used by scientists.
Well, by the same reasoning, you could argue that pixel values (RGBA) are missing in the PEP. It's a convenience, sure, and it may also help interfacing with the platform's FORTRAN implementation - however, are you sure that NumPy's complex layout is consistent with the platform's C99 _Complex definition?
3) Unicode characters
4) What about floating-point representations that are not IEEE 754 4-byte or 8-byte.
Both of these are available in a platform-dependent way: if the platform uses non-IEEE754 formats for C float and C double, ctypes will interface with that just fine. It is actually vice versa: IEEE-754 4-byte and 8-byte is not supported in ctypes. Same for Unicode: the platform's wchar_t is supported (as you said), but not a platform-independent (say) 4-byte little-endian. Regards, Martin
Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
2) complex-valued types (you might argue that it's just a 2-array of floats, but you could say the same thing about int as an array of bytes). The point is how do people interpret the data. Complex-valued data-types are very common. It is one reason Fortran is still used by scientists.
Well, by the same reasoning, you could argue that pixel values (RGBA) are missing in the PEP. It's a convenience, sure, and it may also help interfacing with the platform's FORTRAN implementation - however, are you sure that NumPy's complex layout is consistent with the platform's C99 _Complex definition?
I think so (it is on gcc). And yes, where you draw the line between fundamental and "derived" data-type is somewhat arbitrary. I'd rather include complex-numbers than not given their prevalence in the data-streams I'm trying to make compatible with each other.
3) Unicode characters
4) What about floating-point representations that are not IEEE 754 4-byte or 8-byte.
Both of these are available in a platform-dependent way: if the platform uses non-IEEE754 formats for C float and C double, ctypes will interface with that just fine. It is actually vice versa: IEEE-754 4-byte and 8-byte is not supported in ctypes.
That's what I meant. The 'f' kind in the data-type description is also intended to mean "platform float" whatever that is. But, a complete data-format representation would have a way to describe other bit-layouts for floating point representation. Even if you can't actually calculate directly with them without conversion.
Same for Unicode: the platform's wchar_t is supported (as you said), but not a platform-independent (say) 4-byte little-endian.
Right. It's a matter of scope. Frankly, I'd be happy enough to start with "typecodes" in the extended buffer protocol (that's where the array module is now) and then move up to something more complete later. But, since we already have an array interface for record-arrays to share information and data with each other, and ctypes showing all of it's power, then why not be more complete? -Travis
Travis Oliphant <oliphant.travis <at> ieee.org> writes:
Frankly, I'd be happy enough to start with "typecodes" in the extended buffer protocol (that's where the array module is now) and then move up to something more complete later.
Let's just start with that. The way I see the problem is that buffer protocol is fine as long as your data is an array of bytes, but if it is an array of doubles, you are out of luck. So, while I can do
b = buffer(array('d', [1,2,3]))
there is not much that I can do with b. For example, if I want to pass it to numpy, I will have to provide the type and shape information myself:
numpy.ndarray(shape=(3,), dtype=float, buffer=b) array([ 1., 2., 3.])
With the extended buffer protocol, I should be able to do
numpy.array(b)
So let's start by solving this problem and limit it to data that can be found in a standard library array. This way we can postpone the discussion of shapes, strides and nested structs. I propose a simple bf_gettypeinfo(PyObject *obj, int* type, int* bitsize) method that would return a type code and the size of the data item. I believe it is better to have type codes free from size information for several reasons: 1. Generic code can use size information directly without having to know that int is 32 and double is 64 bits. 2. Odd sizes can be easily described without having to add a new type code. 3. I assume that the existing bf_ functions would still return size in bytes, so having item size available as an int will help to get number of items. If we manage to agree on the standard way to pass primitive type information, it will be a big achievement and immediately useful because simple arrays are already in the standard library.
On 11/1/06, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
Let's just start with that. The way I see the problem is that buffer protocol is fine as long as your data is an array of bytes, but if it is an array of doubles, you are out of luck. So, while I can do
b = buffer(array('d', [1,2,3]))
there is not much that I can do with b. For example, if I want to pass it to numpy, I will have to provide the type and shape information myself:
numpy.ndarray(shape=(3,), dtype=float, buffer=b) array([ 1., 2., 3.])
With the extended buffer protocol, I should be able to do
numpy.array(b)
As a data point, this is the first posting that has clearly explained to me what the two PEPs are attempting to achieve. That may be my blindness to what others find self-evident, but equally, I may not be the only one who needed this example... Paul.
Alexander Belopolsky wrote:
Travis Oliphant <oliphant.travis <at> ieee.org> writes:
b = buffer(array('d', [1,2,3]))
there is not much that I can do with b. For example, if I want to pass it to numpy, I will have to provide the type and shape information myself:
numpy.ndarray(shape=(3,), dtype=float, buffer=b)
array([ 1., 2., 3.])
With the extended buffer protocol, I should be able to do
numpy.array(b)
or just numpy.array(array.array('d',[1,2,3])) and leave-out the buffer object all together.
So let's start by solving this problem and limit it to data that can be found in a standard library array. This way we can postpone the discussion of shapes, strides and nested structs.
Don't lump those ideas together. Shapes and strides are necessary for N-dimensional array's (it's essentially what *defines* the N-dimensional array). I really don't want to sacrifice those in the extended buffer protocol. If you want to separate them into different functions then that is a possibility.
If we manage to agree on the standard way to pass primitive type information, it will be a big achievement and immediately useful because simple arrays are already in the standard library.
We could start there, I suppose. Especially if it helps us all get on the same page. But, we already see the applications beyond this simple case so I would like to have at least an "eye" for the more difficult case which we already have a working solution for in the "array interface" -Travis
Travis Oliphant <oliphant.travis <at> ieee.org> writes:
Don't lump those ideas together. Shapes and strides are necessary for N-dimensional array's (it's essentially what *defines* the N-dimensional array). I really don't want to sacrifice those in the extended buffer protocol. If you want to separate them into different functions then that is a possibility.
I don't understand. Do you want to discuss shapes and strides separately from the datatype or not? Note that in ctypes shape is a property of datatype (as in c_int*2*3). In your proposal, shapes and strides are communicated separately. This presents a unique memory management challenge: if the object does not contain shape information in a ready to be pointed to form, who is responsible for deallocating the shape array?
If we manage to agree on the standard way to pass primitive type information, it will be a big achievement and immediately useful because simple arrays are already in the standard library.
We could start there, I suppose. Especially if it helps us all get on the same page.
Let's start: 1. Should primitive types be associated with simple type codes (short, int, long, float, double) or type/size pairs [(int,16), (int, 32), (int, 64), (float, 32), (float, 64)]? - I prefer pairs 2. Should primitive type codes be characters or integers (from an enum) at C level? - I prefer integers 3. Should size be expressed in bits or bytes? - I prefer bits
Alexander Belopolsky wrote:
Travis Oliphant <oliphant.travis <at> ieee.org> writes:
Don't lump those ideas together. Shapes and strides are necessary for N-dimensional array's (it's essentially what *defines* the N-dimensional array). I really don't want to sacrifice those in the extended buffer protocol. If you want to separate them into different functions then that is a possibility.
I don't understand. Do you want to discuss shapes and strides separately from the datatype or not? Note that in ctypes shape is a property of datatype (as in c_int*2*3). In your proposal, shapes and strides are communicated separately. This presents a unique memory management challenge: if the object does not contain shape information in a ready to be pointed to form, who is responsible for deallocating the shape array?
Perhaps a "view object" should be returned like /F suggests and it manages the shape, strides, and data-format.
If we manage to agree on the standard way to pass primitive type information, it will be a big achievement and immediately useful because simple arrays are already in the standard library.
We could start there, I suppose. Especially if it helps us all get on the same page.
Let's start:
1. Should primitive types be associated with simple type codes (short, int, long, float, double) or type/size pairs [(int,16), (int, 32), (int, 64), (float, 32), (float, 64)]? - I prefer pairs
2. Should primitive type codes be characters or integers (from an enum) at C level? - I prefer integers
Are these orthogonal?
3. Should size be expressed in bits or bytes? - I prefer bits
So, you want an integer enum for the "kind" and an integer for the bitsize? That's fine with me. One thing I just remembered. We have T_UBYTE and T_BYTE, etc. defined in structmember.h already. Should we just re-use those #defines while adding to them to make an easy to use interface for primitive types? -Travis
Travis E. Oliphant <oliphant.travis <at> ieee.org> writes:
Alexander Belopolsky wrote:
... 1. Should primitive types be associated with simple type codes
(short, int, long,
float, double) or type/size pairs [(int,16), (int, 32), (int, 64), (float, 32), (float, 64)]? - I prefer pairs
2. Should primitive type codes be characters or integers (from an enum) at C level? - I prefer integers
Are these orthogonal?
Do you mean are my quiestions 1 and 2 orthogonal? I guess they are.
3. Should size be expressed in bits or bytes? - I prefer bits
So, you want an integer enum for the "kind" and an integer for the bitsize? That's fine with me.
One thing I just remembered. We have T_UBYTE and T_BYTE, etc. defined in structmember.h already. Should we just re-use those #defines while adding to them to make an easy to use interface for primitive types?
I was thinking about using something like NPY_TYPES enum, but T_* codes would work as well. Let me just present both options for the record: --- numpy/ndarrayobject.h --- enum NPY_TYPES { NPY_BOOL=0, NPY_BYTE, NPY_UBYTE, NPY_SHORT, NPY_USHORT, NPY_INT, NPY_UINT, NPY_LONG, NPY_ULONG, NPY_LONGLONG, NPY_ULONGLONG, NPY_FLOAT, NPY_DOUBLE, NPY_LONGDOUBLE, NPY_CFLOAT, NPY_CDOUBLE, NPY_CLONGDOUBLE, NPY_OBJECT=17, NPY_STRING, NPY_UNICODE, NPY_VOID, NPY_NTYPES, NPY_NOTYPE, NPY_CHAR, /* special flag */ NPY_USERDEF=256 /* leave room for characters */ }; --- structmember.h --- /* Types */ #define T_SHORT 0 #define T_INT 1 #define T_LONG 2 #define T_FLOAT 3 #define T_DOUBLE 4 #define T_STRING 5 #define T_OBJECT 6 /* XXX the ordering here is weird for binary compatibility */ #define T_CHAR 7 /* 1-character string */ #define T_BYTE 8 /* 8-bit signed int */ /* unsigned variants: */ #define T_UBYTE 9 #define T_USHORT 10 #define T_UINT 11 #define T_ULONG 12 /* Added by Jack: strings contained in the structure */ #define T_STRING_INPLACE 13 #define T_OBJECT_EX 16 /* Like T_OBJECT, but raises AttributeError when the value is NULL, instead of converting to None. */ #ifdef HAVE_LONG_LONG #define T_LONGLONG 17 #define T_ULONGLONG 18 #endif /* HAVE_LONG_LONG */
Travis E. Oliphant schrieb:
2. Should primitive type codes be characters or integers (from an enum) at C level? - I prefer integers
3. Should size be expressed in bits or bytes? - I prefer bits
So, you want an integer enum for the "kind" and an integer for the bitsize? That's fine with me.
One thing I just remembered. We have T_UBYTE and T_BYTE, etc. defined in structmember.h already. Should we just re-use those #defines while adding to them to make an easy to use interface for primitive types?
Notice that those type codes imply sizes, namely the platform sizes (where "platform" always means "what the C compiler does"). So if you want to have platform-independent codes as well, you shouldn't use the T_ codes. Regards, Martin
Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
2. Should primitive type codes be characters or integers (from an enum) at C level? - I prefer integers
3. Should size be expressed in bits or bytes? - I prefer bits
So, you want an integer enum for the "kind" and an integer for the bitsize? That's fine with me.
One thing I just remembered. We have T_UBYTE and T_BYTE, etc. defined in structmember.h already. Should we just re-use those #defines while adding to them to make an easy to use interface for primitive types?
Notice that those type codes imply sizes, namely the platform sizes (where "platform" always means "what the C compiler does"). So if you want to have platform-independent codes as well, you shouldn't use the T_ codes.
In NumPy we've found it convenient to use both. Basically, we've set up a header file that "does the translation" using #defines and typedefs to create things like (on a 32-bit platform) typedef npy_int32 int #define NPY_INT32 NPY_INT So, that either the T_code-like enum or the bit-width can be used interchangable. Typically people want to specify bit-widths (and see their data-types in bit-widths) but in C-code that implements something you need to use one of the platform integers. I don't know if we really need to bring all of that over. -Travis
Travis E. Oliphant wrote:
We have T_UBYTE and T_BYTE, etc. defined in structmember.h already. Should we just re-use those #defines while adding to them to make an easy to use interface for primitive types?
They're mixed up with size information, though, which we don't want to do. -- Greg
Travis Oliphant wrote:
or just
numpy.array(array.array('d',[1,2,3]))
and leave-out the buffer object all together.
I think the buffer object in his example was just a placeholder for "some arbitrary object that supports the buffer interface", not necessarily another NumPy array. -- Greg
participants (7)
-
"Martin v. Löwis"
-
Alexander Belopolsky
-
Greg Ewing
-
Jim Jewett
-
Paul Moore
-
Travis E. Oliphant
-
Travis Oliphant