[Python-Dev] idea for data-type (data-format) PEP
oliphant.travis at ieee.org
Wed Nov 1 21:18:23 CET 2006
Martin v. Löwis wrote:
> Travis E. Oliphant schrieb:
>>> Or, if it does have uses independent of the buffer extension: what
>>> are those uses?
>> So that NumPy and ctypes and audio libraries and video libraries and
>> database libraries and image-file format libraries can communicate about
>> data-formats using the same expressions (in Python).
> I find that puzzling. In what way can the specification of a data type
> enable communication? Don't you need some kind of protocol for it
> (i.e. operations to be invoked)? Also, do you mean that these libraries
> can communicate with each other? Or with somebody else? If so, with
What is puzzling? I've just specified the extended buffer protocol as
something concrete that data-format objects are shared through. That's
on the C-level. I gave several examples of where such sharing would be
Then, I gave examples in Python of how sharing data-formats would also
be useful so that modules could support the same means to construct
data-formats (instead of struct using strings, array using typecodes,
ctypes using it's type-objects, and NumPy using dtype objects).
>> What problem do you have in defining a standard way to communicate about
>> binary data-formats (not just images)? I still can't figure out why you
>> are so resistant to the idea. MPI had to do it.
> I'm afraid of "dead" specifications, things whose only motivation is
> that they look nice. They are just clutter. There are a few examples
> of this already in Python, like the character buffer interface or
> the multi-segment buffers.
O.K. I can understand that concern. But, all you do is make struct,
array, and ctypes support the same data-format specification (by support
I mean have a way to "consume" and "produce" the data-format object to
the natural represenation that they have internally) and you are
guaranteed it won't "die." In fact, what would be ideal is for the
PIL, NumPy, CVXOpt, PyMedia, PyGame, pyre, pympi, PyVoxel, etc., etc.
(there really are many modules that should be able to talk to each other
more easily) to all support the same data-format representations. Then,
you don't have to learn everybody's re-invention of the same concept
whenever you encounter a new library that does something with binary data.
How much time do you actually spend with binary data (sound, video,
images, just plain numbers from a scientific experiment) and trying to
use multiple Python modules to manipulate it? If you don't spend much
time, then I can understand why you don't understand the need.
> As for MPI: It didn't just independently define a data types system.
> Instead, it did that, *and* specified the usage of the data types
> in operations such as MPI_SEND. It is very clear what the scope of
> this data description is, and what the intended usage is.
> Without specifying an intended usage, it is impossible to evaluate
> whether the specification meets its goals.
What is not understood about the intended usage in the extended buffer
protocol. What is not understood about the intended usage of giving the
array and struct modules a uniform way to represent binary data?
> Ok, that would be a new usage: I expected that datatype instances
> always come in pairs with memory allocated and filled according to
> the description.
To me that is the most important usage, but it's not the *only* one.
> If you are proposing to modify/extend the API
> of the struct and array modules, you should say so somewhere (in
> a PEP).
Sure, I understand that. But, if there is no data-format object, then
there is no PEP to "extend the struct and array modules" to support it.
Chicken before the egg, and all that.
> I expect that the primary readers/users of the PEP would be people who
> have to write libraries: i.e. people implementing NumPy, struct, array,
> and people who implement algorithms that operate on data.
Yes, but not only them. If it's a default way to represent data, then
*users* of those libraries that "consume" the representation would also
benefit by learning a standard.
> So usability
> of the specification is a matter of how easy it is to *write* a library
> that does perform the image manipulation.
>> If you really want to know. In NumPy it might look like this:
>> Python code:
>> img['r'] = img['g']
>> img['b'] = img['g']
> That's not what I'm asking. Instead, what does the NumPy code look
> like that gets invoked on these read-and-write operations? Does it
> only use the void* pointing to the start of the data, and the
> datatype object? If not, how would C code look like that only has
> the void* and the datatype object?
>> dtype = img->descr;
> In this code, is descr a datatype object? ...
Yes. But, I have a mistake later...
>> r_field = PyDict_GetItemString(dtype,'r');
Actually it should read PyDict_GetItemString(dtype->fields). The
r_field is a tuple (data-type object, offset). The fields attribute is
(currently) a Python dictionary.
> ... I guess not, because apparently, it is a dictionary, not
> a datatype object.
Sorry for the confusion.
>> But, I still don't see how that is relevant to the question of how to
>> represent the data-format to share that information across two extensions.
> Well, if NumPy gets the data from a different module, it can't assume
> there is a descr object that is a dictionary. Instead, it must
> perform these operations just by using the datatype object.
Right. I see. Again, I made a mistake in the code.
img->descr is a data-type object in NumPy.
img->descr->fields is a dictionary of fields keyed by 'name' and
returning a tuple (data-type object, offset)
But, the other option (especially for code already written) would be to
just convert the data-format specification into it's own internal
representation. This is the case that I was thinking about when I said
it didn't matter how the library operated on the data.
If new code wanted to use the data-format object as *the* internal
representation, then it would matter.
> else is the purpose of sharing the information, if not to use it
> to access the data?
Of course. I'm sorry my example was incorrect. I guess this falls
under the category of "ease of use".
If the data-type format can *be* the internal representation, then ease
of use is *optimal* because no translation is required. In my ideal
world that's the way it would be. But, even if we can't get there
immediately, we can at least define a standard for communication.
More information about the Python-Dev