[Python-Dev] idea for data-type (data-format) PEP

Wed Nov 1 19:30:07 CET 2006

Martin v. Löwis wrote:
> Travis E. Oliphant schrieb:
>> What if we look at this from the angle of trying to communicate 
>> data-formats between different libraries (not change the way anybody 
>> internally deals with data-formats).
> 
> ISTM that this is not the right approach. If the purpose of the datatype
> object is just to communicate the layout in the extended buffer
> interface, then it should be specified in that PEP, rather than being
> stand-alone, and it should not pretend to serve any other purpose.

I'm actually quite fine with that.  If that is the consensus, then I 
will just go that direction.   ISTM though that since we are putting 
forth the trouble inside the extended buffer protocol we might as well 
be as complete as we know how to be.

> Or, if it does have uses independent of the buffer extension: what
> are those uses?

So that NumPy and ctypes and audio libraries and video libraries and 
database libraries and image-file format libraries can communicate about 
data-formats using the same expressions (in Python).

Maybe we decide that ctypes-based expressions are a very good way to 
communicate about those things in Python for all other packages.  If 
that is the case, then I argue that we ought to change the array module, 
and the struct module to conform (of course keeping the old ways for 
backward compatibility) and set the standard for other packages to follow.

What problem do you have in defining a standard way to communicate about 
binary data-formats (not just images)?  I still can't figure out why you 
are so resistant to the idea.  MPI had to do it.

> 
>> 1) We could define a special string-syntax (or list syntax) that covers 
>> every special case.  The array interface specification goes this 
>> direction and it requires no new Python types.  This could also be seen 
>> as an extension of the "struct" module to allow for nested structures, etc.
>>
>> 2) We could define a Python object that specifically carries data-format 
>> information.
> 
> To distinguish between these, convenience of usage (and of construction)
> should have to be taken into account. At least for the preferred
> alternative, but better for the runners-up, too, there should be a
> demonstration on how existing modules have to be changed to support it
> (e.g. for the struct and array modules as producers; not sure what
> good consumer code would be).

Absolutely --- if something is to be made useful across packages and 
from Python.   This is where the discussion should take place.  The 
struct module and array modules would both be consumers also so that in 
the struct module you could specify your structure in terms of the 
standard data-represenation and in the array module you could specify 
your array in terms of the standard representation instead of using 
"character codes".

> 
> Suppose I wanted to change all RGB values to a gray value (i.e. R=G=B),
> what would the C code look like that does that? (it seems now that the
> primary purpose of this machinery is image manipulation)
> 

For me it is definitely not image manipulation that is the only purpose 
(or even the primary purpose).  It's just an easy one to explain --- 
most people understand images).   But, I think this question is actually 
irrelevant (IMHO).  To me, how you change all RGB values to gray would 
depend on the library you are using not on how data-formats are expressed.

Maybe we are still mis-understanding each other.

If you really want to know.  In NumPy it might look like this:

Python code:

img['r'] = img['g']
img['b'] = img['g']

C-code:

use the Python C-API to do essentially the same thing as above or

to do
img['r'] = img['g']

dtype = img->descr;
r_field = PyDict_GetItemString(dtype,'r');
g_field = PyDict_GetItemString(dtype,'g');
r_field_dtype = PyTuple_GET_ITEM(r_field, 0);
r_field_offset = PyTuple_GET_ITEM(r_field, 1);
g_field_dtype = PyTuple_GET_ITEM(g_field, 0);
g_field_offset = PyTuple_GET_ITEM(g_field, 1);
obj = PyArray_GetField(img, g_field, g_field_offset);
Py_INCREF(r_field)
PyArray_SetField(img, r_field, r_field_offset, obj);

But, I still don't see how that is relevant to the question of how to 
represent the data-format to share that information across two extensions.

>> The problem with 2b is that what works inside an extension module may 
>> not be the best option when it comes to communicating across multiple 
>> extension modules.   Certainly none of the extension modules have argued 
>> that case effectively.
> 
> I think there are two ways in which one option could be "better" than
> the other: it might be more expressive, and it might be easier to use.
> For the second aspect (ease of use), there are two subways: it might
> be easier to produce, or it might be easier to consume.

I like this as a means to judge a data-format representation. Let me 
summarize to see if I understand:

1) Expressive (does it express every data-format you might want or need)
2) Ease of use
    a) Production: How easy is it to create the representation.
    b) Consumption:  How easy is it to interpret the representation.

-Travis