[Python-Dev] idea for data-type (data-format) PEP

Wed Nov 1 21:18:23 CET 2006

Martin v. Löwis wrote:
> Travis E. Oliphant schrieb:
>   
>>> Or, if it does have uses independent of the buffer extension: what
>>> are those uses?
>>>       
>> So that NumPy and ctypes and audio libraries and video libraries and 
>> database libraries and image-file format libraries can communicate about 
>> data-formats using the same expressions (in Python).
>>     
>
> I find that puzzling. In what way can the specification of a data type
> enable communication? Don't you need some kind of protocol for it
> (i.e. operations to be invoked)? Also, do you mean that these libraries
> can communicate with each other? Or with somebody else? If so, with
> whom?
>   
What is puzzling?  I've just specified the extended buffer protocol as 
something concrete that data-format objects are shared through.   That's 
on the C-level.  I gave several examples of where such sharing would be 
useful.

Then, I gave examples in Python of how sharing data-formats would also 
be useful so that modules could support the same means to construct 
data-formats (instead of struct using strings, array using typecodes, 
ctypes using it's type-objects, and NumPy using dtype objects).
>   
>> What problem do you have in defining a standard way to communicate about 
>> binary data-formats (not just images)?  I still can't figure out why you 
>> are so resistant to the idea.  MPI had to do it.
>>     
>
> I'm afraid of "dead" specifications, things whose only motivation is
> that they look nice. They are just clutter. There are a few examples
> of this already in Python, like the character buffer interface or
> the multi-segment buffers.
>   
O.K.  I can understand that concern.    But, all you do is make struct, 
array, and ctypes support the same data-format specification (by support 
I mean have a way to "consume" and "produce" the data-format object to 
the natural represenation that they have internally) and you are 
guaranteed it won't "die."   In fact, what would be ideal is for the 
PIL, NumPy, CVXOpt, PyMedia, PyGame, pyre, pympi, PyVoxel, etc., etc. 
(there really are many modules that should be able to talk to each other 
more easily) to all support the same data-format representations. Then, 
you don't have to learn everybody's  re-invention of the same concept 
whenever you encounter a new library that does something with binary data.

How much time do you actually spend with binary data (sound, video, 
images, just plain numbers from a scientific experiment) and trying to 
use multiple Python modules to manipulate it?  If you don't spend much 
time, then I can understand why you don't understand the need.
> As for MPI: It didn't just independently define a data types system.
> Instead, it did that, *and* specified the usage of the data types
> in operations such as MPI_SEND. It is very clear what the scope of
> this data description is, and what the intended usage is.
>
> Without specifying an intended usage, it is impossible to evaluate
> whether the specification meets its goals.
>   
What is not understood about the intended usage in the extended buffer 
protocol.  What is not understood about the intended usage of giving the 
array and struct modules a uniform way to represent binary data?
> Ok, that would be a new usage: I expected that datatype instances
> always come in pairs with memory allocated and filled according to
> the description. 
To me that is the most important usage, but it's not the *only* one. 

> If you are proposing to modify/extend the API
> of the struct and array modules, you should say so somewhere (in
> a PEP).
>   
Sure, I understand that.  But, if there is no data-format object, then 
there is no PEP to "extend the struct and array modules" to support it.  
Chicken before the egg, and all that.
> I expect that the primary readers/users of the PEP would be people who
> have to write libraries: i.e. people implementing NumPy, struct, array,
> and people who implement algorithms that operate on data.

Yes, but not only them.  If it's a default way to represent data,  then 
*users* of those libraries that "consume" the representation would also 
benefit by learning a standard.

>  So usability
> of the specification is a matter of how easy it is to *write* a library
> that does perform the image manipulation.
>
>   
>> If you really want to know.  In NumPy it might look like this:
>>
>> Python code:
>>
>> img['r'] = img['g']
>> img['b'] = img['g']
>>     
>
> That's not what I'm asking. Instead, what does the NumPy code look
> like that gets invoked on these read-and-write operations? Does it
> only use the void* pointing to the start of the data, and the
> datatype object? If not, how would C code look like that only has
> the void* and the datatype object?
>
>   
>> dtype = img->descr;
>>     
>
> In this code, is descr a datatype object? ...
>   
Yes.  But, I have a mistake later...
>   
>> r_field = PyDict_GetItemString(dtype,'r');
>>     
Actually it should read PyDict_GetItemString(dtype->fields).    The 
r_field is a tuple (data-type object, offset).  The fields attribute is 
(currently) a Python dictionary.

>
> ... I guess not, because apparently, it is a dictionary, not
>   
> a datatype object.
>   
Sorry for the confusion. 

>   
>> But, I still don't see how that is relevant to the question of how to 
>> represent the data-format to share that information across two extensions.
>>     
>
> Well, if NumPy gets the data from a different module, it can't assume
> there is a descr object that is a dictionary. Instead, it must
> perform these operations just by using the datatype object.
Right.  I see.  Again, I made a mistake in the code.

img->descr   is a data-type object in NumPy.

img->descr->fields   is a dictionary of fields keyed by 'name' and 
returning a tuple (data-type object, offset)

But, the other option (especially for code already written) would be to 
just convert the data-format specification into it's own internal 
representation.  This is the case that I was thinking about when I said 
it didn't matter how the library operated on the data. 

If new code wanted to use the data-format object as *the* internal 
representation, then it would matter. 
>  What
> else is the purpose of sharing the information, if not to use it
> to access the data?
>   
Of course.  I'm sorry my example was incorrect.  I guess this falls 
under the category of "ease of use".

If the data-type format can *be* the internal representation, then ease 
of use is *optimal* because no translation is required.  In my ideal 
world that's the way it would be.  But, even if we can't get there 
immediately, we can at least define a standard for communication.