PEP: Adding data-type objects to Python
PEP: <unassigned>
Title: Adding data-type objects to the standard library
Version: $Revision: $
Last-Modified: $Date: $
Author: Travis Oliphant
Travis E. Oliphant schrieb:
The datatype is an object that specifies how a certain block of memory should be interpreted as a basic data-type.
>>> datatype(float) datatype('float64')
I can't speak on the specific merits of this proposal, or whether this kind of functionality is desirable. However, I'm -1 on the addition of a builtin for this functionality (the PEP doesn't actually say that there is another builtin, but the examples suggest so). Instead, putting it into the sys, array, struct, or ctypes modules might be more appropriate, as might be the introduction of another module. Regards, Martin
Travis E. Oliphant wrote:
PEP: <unassigned> Title: Adding data-type objects to the standard library
Not sure about having 3 different ways to specify the structure -- it smacks of Too Many Ways To Do It to me. Also, what if I want to refer to fields by name but don't want to have to work out all the offsets (which is tedious, error-prone and hostile to modification)? -- Greg
Greg Ewing wrote:
Travis E. Oliphant wrote:
PEP: <unassigned> Title: Adding data-type objects to the standard library
I've used 'datatype' below for consistency, but can we please call them something other than data types? Data layouts? Data formats? Binary layouts? Binary formats? 'type' is already a meaningful term in Python, and having to check whether 'data type' meant a type definition or a data format definition could get annoying.
Not sure about having 3 different ways to specify the structure -- it smacks of Too Many Ways To Do It to me.
There are actually 5 ways, but the different mechanisms all have different use
case (and I'm going to suggest getting rid of the dictionary form).
Type-object:
Simple conversion of the builtin types (would be good for instances to be
able to hook this as with other type conversion functions).
2-tuple:
Makes it easy to specify a contiguous C-style array of a given data type.
However, rather than doing type-based dispatch here, I would prefer to see
this version handled via an optional 'shape' argument, so that all sequences
can be handled consistently (more on that below).
>>> datatype(int, 5) # short for datatype([(int, 5)])
datatype('int32', (5,))
# describes a 5*4=20-byte block of memory laid out as
# a[0], a[1], a[2], a[3], a[4]
String-object:
The basic formatting definition (I'd be interested in the differences
between this definition scheme and the struct definition scheme - one definite
goal for an implementation would be an update to the struct module to accept
datatype objects, or at least a conversion mechanism for creating a struct
layout description from a datatype definition)
List object:
As for string object, but permits naming of each of the fields. I don't
like treating tuples differently from lists, so I'd prefer for this handling
applied to be applied to all iterables that don't meet one of the other
special cases (direct conversion, string, dictionary).
I'd also prefer the meta-information to come *after* the name, and for the
name to be completely optional (leaving the corresponding field unnamed). So
the possible sequence entries would be:
datatype
(name, datatype)
(name, datatype, shape)
where name must be a string or 2-tuple, datatype must be acceptable as a
constructor argument, and the shape must be an integer or tuple.
For example:
datatype(([(('coords', [1,2]), 'f4')),
('address', 'S30'),
])
datatype([('simple', 'i4'),
('nested', [('name', 'S30'),
('addr', 'S45'),
('amount', 'i4')
]
),
])
>>> datatype(['V8', ('var2', 'i1'), 'V3', ('var3', 'f8')]
datatype([('', '|V8'), ('var2', '|i1'), ('', '|V3'), ('var3', ' Also, what if I want to refer to fields by name
but don't want to have to work out all the offsets
(which is tedious, error-prone and hostile to
modification)? Use the list definition form. In the current PEP, you would need to define
names for all of the uninteresting fields. With the changes I've suggested
above, you wouldn't even have to name the fields you don't care about - just
describe them.
Cheers,
Nick.
--
Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
---------------------------------------------------------------
http://www.boredomandlaziness.org
Nick Coghlan wrote:
There are actually 5 ways, but the different mechanisms all have different use case (and I'm going to suggest getting rid of the dictionary form).
D'oh, I though I deleted that parenthetical comment... obviously, I changed my mind on this point :) Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
The datatype is an object that specifies how a certain block of memory should be interpreted as a basic data-type.
>>> datatype(float) datatype('float64')
I can't speak on the specific merits of this proposal, or whether this kind of functionality is desirable. However, I'm -1 on the addition of a builtin for this functionality (the PEP doesn't actually say that there is another builtin, but the examples suggest so). Instead, putting it into the sys, array, struct, or ctypes modules might be more appropriate, as might be the introduction of another module.
I'd say the answer to where we put it will be dependent on what happens to the idea of adding a NumArray style fixed dimension array type to the standard library. If that gets exposed through the array module as array.dimarray, then it would make sense to expose the associated data layout descriptors as array.datatype. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
The datatype is an object that specifies how a certain block of memory should be interpreted as a basic data-type.
>>> datatype(float) datatype('float64')
I can't speak on the specific merits of this proposal, or whether this kind of functionality is desirable. However, I'm -1 on the addition of a builtin for this functionality (the PEP doesn't actually say that there is another builtin, but the examples suggest so).
I was intentionally vague. I don't see a need for it to be a built-in, but didn't know where exactly to "put it," I should have made it a question for discussion. -Travis
Greg Ewing wrote:
Travis E. Oliphant wrote:
PEP: <unassigned> Title: Adding data-type objects to the standard library
Not sure about having 3 different ways to specify the structure -- it smacks of Too Many Ways To Do It to me.
You might be right, but they all have use-cases. I've actually removed most of the multiple ways that NumPy allows for creating data-types.
Also, what if I want to refer to fields by name but don't want to have to work out all the offsets
I don't know what you mean. You just use the list-style to define a data-format with fields. The offsets are worked out for you. The only use for offsets was the dictionary form. The dictionary form stems from a desire to use the fields dictionary of a data-type as a data-type specification (which it is essentially is). -Travis
Travis E. Oliphant wrote:
------------------------------------------------------------------------
PEP: <unassigned> Title: Adding data-type objects to the standard library Attributes
kind -- returns the basic "kind" of the data-type. The basic kinds are: 't' - bit, 'b' - bool, 'i' - signed integer, 'u' - unsigned integer, 'f' - floating point, 'c' - complex floating point, 'S' - string (fixed-length sequence of char), 'U' - fixed length sequence of UCS4,
Shouldn't this read "fixed length sequence of Unicode" ?! The underlying code unit format (UCS2 and UCS4) depends on the Python version.
'O' - pointer to PyObject, 'V' - Void (anything else).
-- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 28 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
M.-A. Lemburg wrote:
Travis E. Oliphant wrote:
------------------------------------------------------------------------
PEP: <unassigned> Title: Adding data-type objects to the standard library Attributes
kind -- returns the basic "kind" of the data-type. The basic kinds are: 't' - bit, 'b' - bool, 'i' - signed integer, 'u' - unsigned integer, 'f' - floating point, 'c' - complex floating point, 'S' - string (fixed-length sequence of char), 'U' - fixed length sequence of UCS4,
Shouldn't this read "fixed length sequence of Unicode" ?! The underlying code unit format (UCS2 and UCS4) depends on the Python version.
Well, in NumPy 'U' always means UCS4. So, I just copied that over. See my questions at the bottom which talk about how to handle this. A data-format does not necessarily have to correspond to something Python represents with an Object. -Travis
Hi Travis, On Fri, Oct 27, 2006 at 02:05:31PM -0600, Travis E. Oliphant wrote:
This PEP proposes adapting the data-type objects from NumPy for inclusion in standard Python, to provide a consistent and standard way to discuss the format of binary data.
How does this compare with ctypes? Do we really need yet another, incompatible way to describe C-like data structures in the standard library? A bientot, Armin
Travis E. Oliphant wrote:
M.-A. Lemburg wrote:
Travis E. Oliphant wrote:
------------------------------------------------------------------------
PEP: <unassigned> Title: Adding data-type objects to the standard library Attributes
kind -- returns the basic "kind" of the data-type. The basic kinds are: 't' - bit, 'b' - bool, 'i' - signed integer, 'u' - unsigned integer, 'f' - floating point, 'c' - complex floating point, 'S' - string (fixed-length sequence of char), 'U' - fixed length sequence of UCS4, Shouldn't this read "fixed length sequence of Unicode" ?! The underlying code unit format (UCS2 and UCS4) depends on the Python version.
Well, in NumPy 'U' always means UCS4. So, I just copied that over. See my questions at the bottom which talk about how to handle this. A data-format does not necessarily have to correspond to something Python represents with an Object.
Ok, but why are you being specific about UCS4 (which is an internal storage format), while you are not specific about e.g. the internal bit size of the integers (which could be 32 or 64 bit) ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 28 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
"M.-A. Lemburg"
Travis E. Oliphant wrote:
M.-A. Lemburg wrote:
Travis E. Oliphant wrote:
------------------------------------------------------------------------
PEP: <unassigned> Title: Adding data-type objects to the standard library Attributes
kind -- returns the basic "kind" of the data-type. The basic kinds are: 't' - bit, 'b' - bool, 'i' - signed integer, 'u' - unsigned integer, 'f' - floating point, 'c' - complex floating point, 'S' - string (fixed-length sequence of char), 'U' - fixed length sequence of UCS4, Shouldn't this read "fixed length sequence of Unicode" ?! The underlying code unit format (UCS2 and UCS4) depends on the Python version.
Well, in NumPy 'U' always means UCS4. So, I just copied that over. See my questions at the bottom which talk about how to handle this. A data-format does not necessarily have to correspond to something Python represents with an Object.
Ok, but why are you being specific about UCS4 (which is an internal storage format), while you are not specific about e.g. the internal bit size of the integers (which could be 32 or 64 bit) ?
I think that even on 64 bit platforms, using 'int' or 'long' generally means 32 bit. In order to get 64 bit ints, one needs to use 'long long'. Sharing some of the codes with the struct module, though arbitrary, doesn't seem like a bad idea to me. Of course offering specifically 32 and 64 bit ints would make sense to me. - Josiah
Josiah Carlson wrote:
I think that even on 64 bit platforms, using 'int' or 'long' generally means 32 bit. In order to get 64 bit ints, one needs to use 'long long'.
real 64-bit platforms use the LP64 standard, where long and pointers are both 64 bits: http://www.unix.org/version2/whatsnew/lp64_wp.html </F>
M.-A. Lemburg wrote:
Travis E. Oliphant wrote:
M.-A. Lemburg wrote:
Travis E. Oliphant wrote:
------------------------------------------------------------------------
PEP: <unassigned> Title: Adding data-type objects to the standard library Attributes
kind -- returns the basic "kind" of the data-type. The basic kinds are: 't' - bit, 'b' - bool, 'i' - signed integer, 'u' - unsigned integer, 'f' - floating point, 'c' - complex floating point, 'S' - string (fixed-length sequence of char), 'U' - fixed length sequence of UCS4, Shouldn't this read "fixed length sequence of Unicode" ?! The underlying code unit format (UCS2 and UCS4) depends on the Python version. Well, in NumPy 'U' always means UCS4. So, I just copied that over. See my questions at the bottom which talk about how to handle this. A data-format does not necessarily have to correspond to something Python represents with an Object.
Ok, but why are you being specific about UCS4 (which is an internal storage format), while you are not specific about e.g. the internal bit size of the integers (which could be 32 or 64 bit) ?
The 'kind' does not specify how "big" the data-type (data-format) is. A number is needed to represent the number of bytes. In this case, the 'kind' does not specify how large the data-type is. You can have 'u1', 'u2', 'u4', etc. The same is true with Unicode. You can have 10-character unicode elements, 20-character, etc. But, we have to be clear about what a "character" is in the data-format. -Travis
Armin Rigo wrote:
Hi Travis,
On Fri, Oct 27, 2006 at 02:05:31PM -0600, Travis E. Oliphant wrote:
This PEP proposes adapting the data-type objects from NumPy for inclusion in standard Python, to provide a consistent and standard way to discuss the format of binary data.
How does this compare with ctypes? Do we really need yet another, incompatible way to describe C-like data structures in the standard library?
Part of what the data-type, data-format object is trying to do is bring together all the disparate ways to represent data that *already* exists in the standard library. What is needed is a definitive way to describe data and then have array struct ctypes all be compatible with that same method. That's why I'm proposing the PEP. It's a unification effort not yet-another-method. One of the big reasons for it is to move something like the array interface into Python. There are tens to hundreds of people mostly in the scientific computing community that want to see Python grow more support for NumPy-like things. I keep getting requests to "do something" to make Python more aware of arrays. This PEP is part of that effort. In particular, something like the array interface should be available in Python. The easiest way to do this is to extend the buffer protocol to allow objects to share information about shape, strides, and data-format of a block of memory. But, how do you represent data-format in Python? What will the objects pass back and forth to each other to do it? C-types has a solution which creates multiple objects to do it. This is an un-wieldy over-complicated solution for the array interface. The array objects have a solution using the a single object that carries the data-format information. The solution we have for arrays deserves consideration. It could be placed inside the array module if desired, but again, I'm really looking for something that would allow the extend buffer protocol (to be proposed soon) to share data-type information. That could be done with the array-interface objects (strings, lists, and tuples), but then every body who uses the interface will have to write their own "decoders" to process the data-format information. I actually think ctypes would benefit from this data-format specification too. Recognizing all these diverging ways to essentially talk about the same thing is part of what prompted this PEP. -Travis
Travis E. Oliphant schrieb:
In this case, the 'kind' does not specify how large the data-type is. You can have 'u1', 'u2', 'u4', etc.
The same is true with Unicode. You can have 10-character unicode elements, 20-character, etc. But, we have to be clear about what a "character" is in the data-format.
That is certainly confusing. In u1, u2, u4, the digit seems to indicate the size of a single value (1 byte, 2 bytes, 4 bytes). Right? Yet, in U20, it does *not* indicate the size of a single value but of an array? And then, it's not the size, but the number of elements? Regards, Martin
Travis E. Oliphant wrote:
M.-A. Lemburg wrote:
Travis E. Oliphant wrote:
M.-A. Lemburg wrote:
Travis E. Oliphant wrote:
------------------------------------------------------------------------
PEP: <unassigned> Title: Adding data-type objects to the standard library Attributes
kind -- returns the basic "kind" of the data-type. The basic kinds are: 't' - bit, 'b' - bool, 'i' - signed integer, 'u' - unsigned integer, 'f' - floating point, 'c' - complex floating point, 'S' - string (fixed-length sequence of char), 'U' - fixed length sequence of UCS4, Shouldn't this read "fixed length sequence of Unicode" ?! The underlying code unit format (UCS2 and UCS4) depends on the Python version. Well, in NumPy 'U' always means UCS4. So, I just copied that over. See my questions at the bottom which talk about how to handle this. A data-format does not necessarily have to correspond to something Python represents with an Object. Ok, but why are you being specific about UCS4 (which is an internal storage format), while you are not specific about e.g. the internal bit size of the integers (which could be 32 or 64 bit) ?
The 'kind' does not specify how "big" the data-type (data-format) is. A number is needed to represent the number of bytes.
In this case, the 'kind' does not specify how large the data-type is. You can have 'u1', 'u2', 'u4', etc.
The same is true with Unicode. You can have 10-character unicode elements, 20-character, etc. But, we have to be clear about what a "character" is in the data-format.
I understand and that's why I'm asking why you made the range explicit in the definition. The definition should talk about Unicode code points. The number of bytes then determines whether you can only represent the ASCII subset (1 byte), UCS2 (2 bytes, BMP only) or UCS4 (4 bytes, all currently assigned code points). This is similar to the range for integers (ie. ZZ_0), where the number of bytes determines the range of numbers that can be represented. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 28 2006)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
Nick Coghlan wrote:
Greg Ewing wrote:
Also, what if I want to refer to fields by name but don't want to have to work out all the offsets
Use the list definition form. With the changes I've suggested above, you wouldn't even have to name the fields you don't care about - just describe them.
That would be okay. I still don't see a strong justification for having a one-big-string form as well as a list/tuple/dict form, though. -- Greg
Nick Coghlan wrote:
I'd say the answer to where we put it will be dependent on what happens to the idea of adding a NumArray style fixed dimension array type to the standard library. If that gets exposed through the array module as array.dimarray, then it would make sense to expose the associated data layout descriptors as array.datatype.
Seem to me that arrays are a sub-concept of binary data, not the other way around. So maybe both arrays and data types should be in a module called 'binary' or some such. -- Greg
Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
In this case, the 'kind' does not specify how large the data-type is. You can have 'u1', 'u2', 'u4', etc.
The same is true with Unicode. You can have 10-character unicode elements, 20-character, etc. But, we have to be clear about what a "character" is in the data-format.
That is certainly confusing. In u1, u2, u4, the digit seems to indicate the size of a single value (1 byte, 2 bytes, 4 bytes). Right? Yet, in U20, it does *not* indicate the size of a single value but of an array? And then, it's not the size, but the number of elements?
Good point. In NumPy, unicode support was added "in parallel" with string arrays where there is not the ambiguity. So, yes, it's true that the unicode case is a special-case. The other way to handle it would be to describe the 'code'-point size (i.e. 'U1', 'U2', 'U4' for UCS-1, UCS-2, UCS-4) and then have the length be encoded as an "array" of those types. This was not the direction we took with NumPy (which is what I'm using as a reference) because I wanted Unicode and string arrays to look the same and thought of strings differently. How to handle unicode data-formats could definitely be improved. Suggestions are welcome. -Travis
Greg Ewing wrote:
Nick Coghlan wrote:
I'd say the answer to where we put it will be dependent on what happens to the idea of adding a NumArray style fixed dimension array type to the standard library. If that gets exposed through the array module as array.dimarray, then it would make sense to expose the associated data layout descriptors as array.datatype.
Seem to me that arrays are a sub-concept of binary data, not the other way around. So maybe both arrays and data types should be in a module called 'binary' or some such.
Yes, very good point. That's probably one reason I'm proposing the data-type first before the array interface in the extended buffer protocol. -Travis
Greg Ewing wrote:
Nick Coghlan wrote:
Greg Ewing wrote:
Also, what if I want to refer to fields by name but don't want to have to work out all the offsets
Use the list definition form. With the changes I've suggested above, you wouldn't even have to name the fields you don't care about - just describe them.
That would be okay.
I still don't see a strong justification for having a one-big-string form as well as a list/tuple/dict form, though.
Compaction of representation is all. It's used quite a bit in numarray, which is where most of the 'kind' names came from as well. When you don't want to name fields it is a really nice feature (but it doesn't nest well). -Travis
Greg Ewing wrote:
Travis E. Oliphant wrote:
The 'kind' does not specify how "big" the data-type (data-format) is.
What exactly does "bit" mean in that context?
Do you mean "big" ? It's how many bytes the kind is using. So, 'u4' is a 4-byte unsigned integer and 'u2' is a 2-byte unsigned integer. -Travis
Travis E. Oliphant schrieb:
What is needed is a definitive way to describe data and then have
array struct ctypes
all be compatible with that same method. That's why I'm proposing the PEP. It's a unification effort not yet-another-method.
As I unification mechanism, I think it is insufficient. I doubt it can express all the concepts that ctypes supports. Regards, Martin
Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
What is needed is a definitive way to describe data and then have
array struct ctypes
all be compatible with that same method. That's why I'm proposing the PEP. It's a unification effort not yet-another-method.
As I unification mechanism, I think it is insufficient. I doubt it can express all the concepts that ctypes supports.
What do you think is missing that can't be added? -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Travis E. Oliphant schrieb:
How to handle unicode data-formats could definitely be improved.
As before, I'm doubtful what the actual needs are. For example, is it desired to support generation of ID3v2 tags with such a data format? The tag is specified here: http://www.id3.org/id3v2.4.0-structure.txt In ID3v1, text fields have a specified width, and are supposed to be encoded in Latin-1, and padded with zero bytes. In ID3v2, text fields start with an encoding declaration (say, \x03 for UTF-8), then followed with a null-terminated sequence of UTF-8 bytes. Is it the intent of this PEP to support such data structures, and allow the user to fill in a Unicode object, and then the processing is automatic? (i.e. in ID3v1, the string gets automatically Latin-1-encoded and zero-padded, in ID3v2, it gets automatically UTF-8 encoded, and null-terminated) If that is not to be supported, what are the use cases? Regards, Martin
Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
What is needed is a definitive way to describe data and then have
array struct ctypes
all be compatible with that same method. That's why I'm proposing the PEP. It's a unification effort not yet-another-method.
As I unification mechanism, I think it is insufficient. I doubt it can express all the concepts that ctypes supports.
Please clarify what you mean. Are you saying that a single object can't carry all the information about binary data that ctypes allows with it's multi-object approach? I don't agree with you, if that is the case. Sure, perhaps I've not included certain cases, so give an example. Besides, I don't think this is the right view of "unification". I'm not saying that ctypes should get rid of it's many objects used for interfacing with C-functions. I'm saying we should introduce a single-object mechanism for describing binary data so that the many-object approach of c-types does not become some kind of de-facto standard. C-types can "translate" this object-instance to its internals if and when it needs to. In the mean-time, how are other packages supposed to communicate binary information about data with each other? Remember the context that the data-format object is presented in. Two packages need to share a chunk of memory (the package authors do not know each other and only have and Python as a common reference). They both want to describe that the memory they are sharing has some underlying binary structure. How do they do that? Please explain to me how the buffer protocol can be extended so that information about "what is in the memory" can be shared without a data-format object? -Travis
Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
How to handle unicode data-formats could definitely be improved.
As before, I'm doubtful what the actual needs are. For example, is it desired to support generation of ID3v2 tags with such a data format? The tag is specified here:
Perhaps I was not clear enough about what I'm try to do. For a long time a lot of people have wanted something like Numeric in Python itself. There have been many hurdles to that goal. After discussions at SciPy 2006 with Guido, we decided that the best way to proceed at this point was to extend the buffer protocol to allow packages to share array-like information with each-other. There are several things missing from the buffer protocol that NumPy needs in order to be able to really understand the (fixed-size) memory another package has allocated and is sharing. The most important of these is 1) Shape information 2) Striding information 3) Data-format information (how is each element perceived). Shape and striding information can be shared with a C-array of integers. How is data-format information supposed to be shared? We've come up with a very flexible way to do this in NumPy using a single Python object. This Python object supports describing the layout of any fixed-size chunk of memory (right now in units of bytes --- bit fields could be added, though). I'm proposing to add this object to Python so that the buffer protcol has a fast and efficient way to share #3. That's really all I'm after. It also bothers me that so many ways to describe binary data are being used out there. This is a problem that deserves being solved. And, no, ctypes hasn't solved it (we can't directly use the ctypes solution). Perhaps this PEP doesn't hit all the corners, but a data-format object *is* a useful thing to consider. The array object in Python already has a PyArray_Descr * structure that is a watered-down version of what I'm talking about. In fact, this is what Numeric built from (or vice-versa actually). And NumPy has greatly enhanced this object for any conceivable structure. Guido seemed to think the data-type objects were nice when he saw them at SciPy 2006, and so I'm presenting a PEP. Without the data-format object, I'm don't know how to extend the buffer protocol to communicate data-format information. Do you have a better idea? I have no trouble limiting the data-type object to the buffer protocol extension PEP, but I do think it could gain wider use.
Is it the intent of this PEP to support such data structures, and allow the user to fill in a Unicode object, and then the processing is automatic? (i.e. in ID3v1, the string gets automatically Latin-1-encoded and zero-padded, in ID3v2, it gets automatically UTF-8 encoded, and null-terminated)
No, the point of the data-format object is to communicate information about data-formats not to encode or decode anything. Users of the data-format object could decide what they wanted to do with that information. We just need a standard way to communicate it through the buffer protocol. -Travis
Travis E. Oliphant schrieb:
I'm proposing to add this object to Python so that the buffer protcol has a fast and efficient way to share #3. That's really all I'm after.
I admit that I don't understand this objective. Why is it desirable to support such an extended buffer protocol? What specific application would be made possible if it was available and implemented in the relevant modules and data types? What are the relevant modules and data types that should implement it?
It also bothers me that so many ways to describe binary data are being used out there. This is a problem that deserves being solved. And, no, ctypes hasn't solved it (we can't directly use the ctypes solution). Perhaps this PEP doesn't hit all the corners, but a data-format object *is* a useful thing to consider.
IMO, it is only useful if it realistically can support all the use cases
that it intends to support. If this PEP is about defining the elements
of arrays, I doubt it can realistically support everything you can
express in ctypes. There is no support for pointers (except for
PyObject*), no support for incomplete (recursive) types, no support
for function pointers, etc.
Vice versa: why exactly can't you use the data type system of ctypes?
If I want to say "int[10]", I do
py> ctypes.c_long * 10
Guido seemed to think the data-type objects were nice when he saw them at SciPy 2006, and so I'm presenting a PEP.
I have no objection to including NumArray as-is into Python. I just wonder were the rationale for this PEP comes from, i.e. why do you need to exchange this information across different modules?
Without the data-format object, I'm don't know how to extend the buffer protocol to communicate data-format information. Do you have a better idea?
See above: I can't understand where the need for an extended buffer protocol comes from. I can see why NumArray needs reflection, and needs to keep information to interpret the bytes in the array. But why is it important that the same information is exposed by other data types?
Is it the intent of this PEP to support such data structures, and allow the user to fill in a Unicode object, and then the processing is automatic? (i.e. in ID3v1, the string gets automatically Latin-1-encoded and zero-padded, in ID3v2, it gets automatically UTF-8 encoded, and null-terminated)
No, the point of the data-format object is to communicate information about data-formats not to encode or decode anything. Users of the data-format object could decide what they wanted to do with that information. We just need a standard way to communicate it through the buffer protocol.
This was actually a different sub-thread: why do you need to support the 'U' code (or the 'S' code, for that matter)? In what application do you have fixed size Unicode arrays, as opposed to Unicode strings? Regards, Martin
Travis E. Oliphant schrieb:
As I unification mechanism, I think it is insufficient. I doubt it can express all the concepts that ctypes supports.
Please clarify what you mean.
Are you saying that a single object can't carry all the information about binary data that ctypes allows with it's multi-object approach?
I'm not sure what you mean by "single object". If I use the tuple syntax, e.g. datatype((float, (3,2)) There are also multiple objects (the float, the 3, and the 2). You get a single "root" object back, but so do you in ctypes. But this isn't really what I meant. Instead, I think the PEP lacks various concepts from C data types, such as pointers, unions, function pointers, alignment/packing.
In the mean-time, how are other packages supposed to communicate binary information about data with each other?
This is my other question. Why should they?
Remember the context that the data-format object is presented in. Two packages need to share a chunk of memory (the package authors do not know each other and only have and Python as a common reference). They both want to describe that the memory they are sharing has some underlying binary structure.
Can you please give an example of such two packages, and an application that needs them share data? Regards, Martin
Robert Kern schrieb:
As I unification mechanism, I think it is insufficient. I doubt it can express all the concepts that ctypes supports.
What do you think is missing that can't be added?
I can factually only report what is missing. Whether it can be added, I don't know. As I just wrote in a few other messages: pointers, unions, functions pointers, packed structs, incomplete/recursive types. Also "flexible array members" (i.e. open-ended arrays). While it may be possible to come up with a string syntax to describe all these things (*), I wonder whether it should be done, and whether NumArray can then support this extended data model. Regards, Martin (*) perhaps with the exception of incomplete types: C needs forward references in its own syntax.
I have watched numpy with interest for a long time. My own interest is to possibly use the c-api to wrap c++ algorithms to use from python. One thing that has concerned me, and continues to concern me with this proposal, is that it seems to suffer from a very fat interface. I certainly have not studied the options in any depth, but my gut feeling is that the interface is too fat and too complex. I wonder if it's possible to avoid this. I wonder if this is an example of all the methods sinking to the base class.
On 10/29/06, "Martin v. Löwis"
Travis E. Oliphant schrieb:
Remember the context that the data-format object is presented in. Two packages need to share a chunk of memory (the package authors do not know each other and only have and Python as a common reference). They both want to describe that the memory they are sharing has some underlying binary structure.
Can you please give an example of such two packages, and an application that needs them share data?
Here's an example. PIL handles images (in various formats) in memory, as blocks of binary image data. NumPy provides methods for manipulating in-memory blocks of data. Now, if I want to use NumPy to manipulate that data in place (for example, to cap the red component at 128, and equalise the range of the green component) my code needs to know the format of the memory block that PIL exposes. I am assuming that in-place manipulation is better, because there is no need for repeated copies of the data to be made (this would be true for large images). If PIL could expose a descriptor for its data structure, NumPy code could manipulate it in place without fear of corrupting it. Of course, this can be done by the end user reading the PIL documentation and transcribing the documented format into the NumPy code. But I would argue that it's better if the PIL block is self-describing in a way that avoids the need for a manual transcription of the format. To do this *without* needing the PIL and NumPy developers to co-operate needs an independent standard, which is what I assume this PEP is intended to provide. Paul.
"Paul Moore"
On 10/29/06, "Martin v. Löwis"
wrote: Travis E. Oliphant schrieb:
Remember the context that the data-format object is presented in. Two packages need to share a chunk of memory (the package authors do not know each other and only have and Python as a common reference). They both want to describe that the memory they are sharing has some underlying binary structure.
Can you please give an example of such two packages, and an application that needs them share data?
To do this *without* needing the PIL and NumPy developers to co-operate needs an independent standard, which is what I assume this PEP is intended to provide.
One could also toss wxPython, VTK, or any one of the other GUI libraries into the mix for visualizing those images, of which wxPython just acquired no-copy display of PIL images, and being able to manipulate them with numpy (of which some wxPython built in classes use numpy to speed up manipulation) would be very useful. Of all of the intended uses, I'd say that zero-copy sharing of information on the graphics/visualization front is the most immediate 'people will be using it tomorrow' feature. I personally don't have my pulse on the Scientific Python community, so I don't know about other uses, but in regards to Martin's list of missing features: "pointers, unions, function pointers, alignment/packing [, etc.]" I'm going to go out on a limb and say for the majority of those YAGNI, or really, NOHAFIAFACT (no one has asked for it, as far as I can tell). Someone who knows the scipy community, feel free to correct me. - Josiah
"Martin v. Löwis"
Josiah Carlson schrieb:
One could also toss wxPython, VTK, or any one of the other GUI libraries into the mix for visualizing those images, of which wxPython just acquired no-copy display of PIL images, and being able to manipulate them with numpy (of which some wxPython built in classes use numpy to speed up manipulation) would be very useful.
I'm doubtful that this PEP alone would allow zero-copy sharing of images for display. Often, the libraries need the data in a different format. So they need to copy, even if they could understand the other format. However, the PEP won't allow "understanding" the format. If I know I have an array of 4-byte values: which of them is R, G, B, and A?
...in the cases I have seen, which includes BMP, TGA, uncompressed TIFF, a handful of platform-specific bitmap formats, etc., you _always_ get them in RGBA order. If the alpha channel is to be left out, then you get them as RGB. The trick with allowing zero-copy sharing is 1) to understand the format, and 2) to manipulate/display in-place. The former is necessary for the latter, which is what Travis is shooting for. Also, because wxPython has figured out how PIL images are structured, they can do #2, and so far no one has mentioned any examples where the standard RGB/RGBA format hasn't worked for them. In the case of jpegs (as you mentioned in another message), PIL uncompresses all images it understands into some kind of 'natural' format (from what I understand). For 24/32 bit images, that is RGB or RGBA. For palletized images (gif, 8-bit png, 8-bit bmp, etc.) maybe it is a palletized format, or maybe it is RGB/RGBA? I don't know, all of my images are 24/32 bit, but I can just about guarantee it's not an issue for the case that Paul mentioned. - Josiah
Paul Moore schrieb:
Here's an example. PIL handles images (in various formats) in memory, as blocks of binary image data. NumPy provides methods for manipulating in-memory blocks of data. Now, if I want to use NumPy to manipulate that data in place (for example, to cap the red component at 128, and equalise the range of the green component) my code needs to know the format of the memory block that PIL exposes. I am assuming that in-place manipulation is better, because there is no need for repeated copies of the data to be made (this would be true for large images).
Thanks, that looks like a good example. Is it possible to elaborate that? E.g. what specific image format would I use (could that work for jpeg, even though this format has compression in it), and what specific NumPy routines would I use to implement the capping and equalising? What would the datatype description look like that those tools need to exchange? Looking at this in more detail, PIL in-memory images (ImagingCore objects) either have the image8 UINT8**, or the image32 INT32**; they have separate fields for pixelsize and linesize. In the image8 case, there are three options: - each value is an 8-bit integer (IMAGING_TYPE_UINT8) (1) - each value is a 16-bit integer, either little (2) or big endian (3) (IMAGING_TYPE_SPECIAL, mode either I;16 or I;16B) In the image32 case, there are five options: - two 8-bit values per four bytes, namely byte 0 and byte 3 (4) - three 8-bit values (bytes 0, 1, 2) (5) - four 8-bit values (6) - a single 32-bit int (7) - a single 32-bit float (8) Now, what would be the algorithm in NumPy that I could use to implement capping and equalising?
If PIL could expose a descriptor for its data structure, NumPy code could manipulate it in place without fear of corrupting it. Of course, this can be done by the end user reading the PIL documentation and transcribing the documented format into the NumPy code. But I would argue that it's better if the PIL block is self-describing in a way that avoids the need for a manual transcription of the format.
Without digging further, I think some of the formats simply don't allow for the kind of manipulation you suggest, namely all palette formats (which are the single-valued ones, plus the two-band version with a palette number and an alpha value), and greyscale images. So in any case, the application has to look at the mode of the image to find out whether the operation is even meaningful. And then, the application has to tell NumPy somehow what fields to operate on.
To do this *without* needing the PIL and NumPy developers to co-operate needs an independent standard, which is what I assume this PEP is intended to provide.
Ok, I now understand the goal, although I still like to understand this usecase better. Regards, Martin
Josiah Carlson schrieb:
One could also toss wxPython, VTK, or any one of the other GUI libraries into the mix for visualizing those images, of which wxPython just acquired no-copy display of PIL images, and being able to manipulate them with numpy (of which some wxPython built in classes use numpy to speed up manipulation) would be very useful.
I'm doubtful that this PEP alone would allow zero-copy sharing of images for display. Often, the libraries need the data in a different format. So they need to copy, even if they could understand the other format. However, the PEP won't allow "understanding" the format. If I know I have an array of 4-byte values: which of them is R, G, B, and A? Regards, Martin
Travis E. Oliphant wrote:
Greg Ewing wrote:
What exactly does "bit" mean in that context?
Do you mean "big" ?
No, you've got a data type there called "bit", which seems to imply a size, in contradiction to the size-independent nature of the other types. I'm asking what size-independent information it's meant to convey. -- Greg
Travis E. Oliphant wrote:
Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
Is it the intent of this PEP to support such data structures, and allow the user to fill in a Unicode object, and then the processing is automatic?
No, the point of the data-format object is to communicate information about data-formats not to encode or decode anything.
Well, there's still the issue of how much detail you want to be able to convey, so I think the question is valid. Is the encoding of a Unicode string something we want to be able to communicate via this mechanism, or is that outside its scope? -- Greg
Josiah Carlson wrote:
...in the cases I have seen ... you _always_ get them in RGBA order.
Except when you don't. I've had cases where I've had to convert between RGBA and BGRA (for stuffing directly into a frame buffer on Linux, as far as I remember). So it may be worth including some features in the standard for describing pixel formats. Pygame seems to have a very detailed and flexible system for doing this, so it might be a good idea to have a look at that. -- Greg
Neal Becker wrote:
I have watched numpy with interest for a long time. My own interest is to possibly use the c-api to wrap c++ algorithms to use from python.
One thing that has concerned me, and continues to concern me with this proposal, is that it seems to suffer from a very fat interface. I certainly have not studied the options in any depth, but my gut feeling is that the interface is too fat and too complex. I wonder if it's possible to avoid this. I wonder if this is an example of all the methods sinking to the base class.
You've just described my number #1 concern with incorporating NumPy wholesale, and the reason I believe it would be nice to cherry-pick a couple of key components for the standard library, rather than adopting the whole thing. Travis has done a lot of work towards that goal (the latest result of which is this pre-PEP for describing the individual array elements in a way that is more flexible than the single character codes of the current array module). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- http://www.boredomandlaziness.org
Would it be possible to make the data-type objects subclassable, with
the subclasses being able to override the equality test?
The range of data types that you've specified in the PEP are good
enough for most general use, and probably for NumPy as well, but
someone already came up with the example of image formats, which have
their whole own range of data formats. I could throw in audio formats
(bits per sample, excess-N or signed or ulaw samples, mono/stereo/5.1/
etc, order of the channels), and there's probably a whole slew of
other areas that have their own sets of formats.
If the datatype objects are subclassable, modules could initially
start by adding their own formats. So, the "jackaudio" and
"jillaudio" modules would have distinct sets of formats. But then
later on it should be fairly easy for them to recognize each others
formats. So, jackaudio would recognize the jillaudio format "msdos
linear pcm" as being identical to its own "16-bit excess-32768".
Hopefully eventually all audio module writers would get together and
define a set of standard audio formats.
--
Jack Jansen,
...in the cases I have seen, which includes BMP, TGA, uncompressed TIFF, a handful of platform-specific bitmap formats, etc., you _always_ get them in RGBA order. If the alpha channel is to be left out, then you get them as RGB.
Mac OS X unfortunately uses ARGB. Writing some alti-vec code remedied that for passing it around to the OpenCV library. Just my $.02 Diez
Martin v. Löwis wrote:
Josiah Carlson schrieb:
One could also toss wxPython, VTK, or any one of the other GUI libraries into the mix for visualizing those images, of which wxPython just acquired no-copy display of PIL images, and being able to manipulate them with numpy (of which some wxPython built in classes use numpy to speed up manipulation) would be very useful.
I'm doubtful that this PEP alone would allow zero-copy sharing of images for display. Often, the libraries need the data in a different format. So they need to copy, even if they could understand the other format. However, the PEP won't allow "understanding" the format. If I know I have an array of 4-byte values: which of them is R, G, B, and A?
You give a name to the fields: 'R', 'G', 'B', and 'A'. -Travis
Greg Ewing wrote:
Travis E. Oliphant wrote:
Greg Ewing wrote:
What exactly does "bit" mean in that context?
Do you mean "big" ?
No, you've got a data type there called "bit", which seems to imply a size, in contradiction to the size-independent nature of the other types. I'm asking what size-independent information it's meant to convey.
Ah. I see what you were saying now. I guess the 'bit' type is different (we actually don't have that type in NumPy so my understanding of it is limited). The 'bit' type re-intprets the size information to be in units of "bits" and so implies a "bit-field" instead of another data-format. -Travis
Martin v. Löwis wrote:
Robert Kern schrieb:
As I unification mechanism, I think it is insufficient. I doubt it can express all the concepts that ctypes supports.
What do you think is missing that can't be added?
I can factually only report what is missing. Whether it can be added, I don't know. As I just wrote in a few other messages: pointers, unions, functions pointers, packed structs, incomplete/recursive types. Also "flexible array members" (i.e. open-ended arrays).
I understand function pointers, pointers, and unions. Function pointers are "supported" with the void data-type and could be more specifically supported if it were important. People typically don't use the buffer protocol to send function-pointers around in a way that the void description wouldn't be enough. Pointers are also "supported" with the void data-type. If pointers to other data-types were an important feature to support, then this could be added in many ways (a flag on the data-type object for example is how this is done is NumPy). Unions are actually supported (just define two fields with the same offset). I don't know what you mean by "packed structs" (unless you are talking about alignment issues in which case there is support for it). I'm not sure I understand what you mean by "incomplete / recursive" types unless you are referring to something like a node where an element of the structure is a pointer to another structure of the same kind (like used in linked-lists or trees). If that is the case, then it's easily supported once support for pointers is added. I also don't know what you mean by "open-ended arrays." The data-format is meant to describe a fixed-size chunk of data. String syntax is not needed to support all of these things. What I'm asking for and proposing is a way to construct an instance of a single Python type that communicates this data-format information in a standardized way. -Travis
Travis Oliphant schrieb:
Function pointers are "supported" with the void data-type and could be more specifically supported if it were important. People typically don't use the buffer protocol to send function-pointers around in a way that the void description wouldn't be enough.
As I said before, I can't tell whether it's important, as I still don't know what the purpose of this PEP is. If it is to support a unification of memory layout specifications, and if that unifications is also to include ctypes, then yes, it is important. If it is to describe array elements in NumArray arrays, then it might not be important. For the usage of ctypes, the PEP void type is insufficient to describe function pointers: you also need a specification of the signature of the function pointer (parameter types and return type), or else you can't use the function pointer (i.e. you can't call the function).
Pointers are also "supported" with the void data-type. If pointers to other data-types were an important feature to support, then this could be added in many ways (a flag on the data-type object for example is how this is done is NumPy).
For ctypes, (I think) you need "true" pointers to other layouts, or else you couldn't set up the memory correctly. I don't understand how this could work with some extended buffer protocol, though: would a buffer still have to be a contiguous piece of memory? If you have structures with pointers in them, they rarely point to contiguous memory.
Unions are actually supported (just define two fields with the same offset).
Ah, ok. What's the string syntax for it?
I don't know what you mean by "packed structs" (unless you are talking about alignment issues in which case there is support for it).
Yes, this is indeed about alignment; I missed it. What's the string syntax for it?
I'm not sure I understand what you mean by "incomplete / recursive" types unless you are referring to something like a node where an element of the structure is a pointer to another structure of the same kind (like used in linked-lists or trees). If that is the case, then it's easily supported once support for pointers is added.
That's what I mean, yes. I'm not sure how it can easily be added, though. Suppose you want to describe struct item{ int key; char* value; struct item *next; }; How would you do that? Something like item = datatype([('key', 'i4'), ('value', 'S*'), ('next', 'what_to_put_here*')] can't work: item hasn't been assigned, yet, so you can't use it as the field type.
I also don't know what you mean by "open-ended arrays." The data-format is meant to describe a fixed-size chunk of data.
I see. In C (and thus in ctypes), you sometimes have what C99 calls "flexible array member": struct PyString{ Py_ssize_t ob_refcnt; PyObject *ob_type; Py_ssize_t ob_len; char ob_sval[]; }; where the ob_sval field can extend arbitrarily, as it is the last member of the struct. Of course, this will give you dynamically-sized objects (objects in C cannot really be "variable-sized", since the size of a memory block has to be defined at allocation time, and can't really change afterwards).
String syntax is not needed to support all of these things.
Ok. That's confusing in the PEP: it's not clear whether all these forms are meant to be equivalent, and, if not, which one is the most generic one, and what aspects are missing in what forms. Also, if you have a datatype which cannot be expressed in the string syntax, what is its "str" attribute? Regards, Martin
Travis Oliphant wrote:
The 'bit' type re-intprets the size information to be in units of "bits" and so implies a "bit-field" instead of another data-format.
Hmmm, okay, but now you've got another orthogonality problem, because you can't distinguish between e.g. a 5-bit signed int field and a 5-bit unsigned int field. It might be better not to consider "bit" to be a type at all, and come up with another way of indicating that the size is in bits. Perhaps 'i4' # 4-byte signed int 'i4b' # 4-bit signed int 'u4' # 4-byte unsigned int 'u4b' # 4-bit unsigned int (Next we can have an argument about whether bit fields should be packed MSB-to-LSB or vice versa...:-) -- Greg
Travis Oliphant wrote:
I'm not sure I understand what you mean by "incomplete / recursive" types unless you are referring to something like a node where an element of the structure is a pointer to another structure of the same kind (like used in linked-lists or trees).
Yes, and more complex arrangements of types that reference each other.
If that is the case, then it's easily supported once support for pointers is added.
But it doesn't fit easily into the single-object model. -- Greg
Armin Rigo wrote:
Hi Travis,
On Fri, Oct 27, 2006 at 02:05:31PM -0600, Travis E. Oliphant wrote:
This PEP proposes adapting the data-type objects from NumPy for inclusion in standard Python, to provide a consistent and standard way to discuss the format of binary data.
How does this compare with ctypes? Do we really need yet another, incompatible way to describe C-like data structures in the standarde library?
There is a lot of subtlety in the details that IMHO clouds the central issue which I will try to clarify here the way I see it. First of all: In order to make sense of the data-format object that I'm proposing you have to see the need to share information about data-format through an extended buffer protocol (which I will be proposing soon). I'm not going to try to argue that right now because there are a lot of people who can do that. So, I'm going to assume that you see the need for it. If you don't, then just suspend concern about that for the moment. There are a lot of us who really see the need for it. Now: To describe data-formats ctypes uses a Python type-object defined for every data-format you might need. In my view this is an 'over-use' of the type-object and in fact, to be useful, requires the definition of a meta-type that carries the relevant additions to the type-object that are needed to describe data (like function pointers to get data in and out of Python objects). My view is that it is un-necessary to use a different type object to describe each different data-type. The route I'm proposing is to define (in C) a *single* new Python type (called a data-format type) that carries the information needed to describe a chunk of memory. In this way *instances* of this new type define data-formats. In ctypes *instances* of the "meta-type" (i.e. new types) define data-formats (actually I'm not sure if all the new c-types are derived from the same meta-type). So, the big difference is that I think data-formats should be *instances* of a single type. There is no need to define a Python type-object for every single data-type. In fact, not only is there no need, it makes the extended buffer protocol I'm proposing even more difficult to use and explain. Again, my real purpose is the extended buffer protocol. These data-format type is a means to that end. If the consensus is that nobody sees a greater use of the data-format type beyond the buffer protocol, then I will just write 1 PEP for the extended buffer protocol. -Travis
Greg Ewing wrote:
Travis Oliphant wrote:
The 'bit' type re-intprets the size information to be in units of "bits" and so implies a "bit-field" instead of another data-format.
Hmmm, okay, but now you've got another orthogonality problem, because you can't distinguish between e.g. a 5-bit signed int field and a 5-bit unsigned int field.
Good point.
It might be better not to consider "bit" to be a type at all, and come up with another way of indicating that the size is in bits. Perhaps
'i4' # 4-byte signed int 'i4b' # 4-bit signed int 'u4' # 4-byte unsigned int 'u4b' # 4-bit unsigned int
I like this. Very nice. I think that's the right way to look at it.
(Next we can have an argument about whether bit fields should be packed MSB-to-LSB or vice versa...:-)
I guess we need another flag / attribute to indicate that. The other thing that needs to be discussed at some point may be a way to indicate the floating-point format. I've basically punted on this and just meant 'f' to mean "platform float" Thus, you can't use the data-type object to pass information between two platforms that don't share a common floating point representation. -Travis
M.-A. Lemburg wrote:
Travis E. Oliphant wrote:
I understand and that's why I'm asking why you made the range explicit in the definition.
In the case of NumPy it was so that String and Unicode arrays would both look like multi-length string "character" arrays and not arrays of arrays of some character. But, this can change in the data-format object. I can see that the Unicode description needs to be improved.
The definition should talk about Unicode code points. The number of bytes then determines whether you can only represent the ASCII subset (1 byte), UCS2 (2 bytes, BMP only) or UCS4 (4 bytes, all currently assigned code points).
Yes, you are correct. A string of unicode characters should really be represented in the same way that an array of integers is represented for a data-format object. -Travis
Travis Oliphant schrieb:
So, the big difference is that I think data-formats should be *instances* of a single type.
This is nearly the case for ctypes as well. All layout descriptions
are instances of the type type. Nearly, because they are instances
of subtypes of the type type:
py> type(ctypes.c_long)
On 10/31/06, Travis Oliphant
In order to make sense of the data-format object that I'm proposing you have to see the need to share information about data-format through an extended buffer protocol (which I will be proposing soon). I'm not going to try to argue that right now because there are a lot of people who can do that.
So, I'm going to assume that you see the need for it. If you don't, then just suspend concern about that for the moment. There are a lot of us who really see the need for it.
[...]
Again, my real purpose is the extended buffer protocol. These data-format type is a means to that end. If the consensus is that nobody sees a greater use of the data-format type beyond the buffer protocol, then I will just write 1 PEP for the extended buffer protocol.
While I don't personally use NumPy, I can see where an extended buffer protocol like you describe could be advantageous, and so I'm happy to concede that benefit. I can also vaguely see that a unified "block of memory description" would be useful. My interest would be in the area of the struct module (unpacking and packing data for dumping to byte streams - whether this happens in place or not is not too important to this use case). However, I cannot see how your proposal would help here in practice - does it include the functionality of the struct module (or should it?) If so, then I'd like to see examples of equivalent constructs. If not, then isn't it yet another variation on the theme, adding to the problem of multiple approaches rather than helping? I can also see the parallels with ctypes. Here I feel a little less sure that keeping the two approaches is wrong. I don't know why I feel like that - maybe nothing more than familiarity with ctypes - but I don't have the same reluctance to have both the ctypes data definition stuff and the new datatype proposal. Enough of the abstract. As a concrete example, suppose I have a (byte) string in my program containing some binary data - an ID3 header, or a TCP packet, or whatever. It doesn't really matter. Does your proposal offer anything to me in how I might manipulate that data (assuming I'm not using NumPy)? (I'm not insisting that it should, I'm just trying to understand the scope of the PEP). Paul.
It might be better not to consider "bit" to be a type at all, and come up with another way of indicating that the size is in bits. Perhaps
'i4' # 4-byte signed int 'i4b' # 4-bit signed int 'u4' # 4-byte unsigned int 'u4b' # 4-bit unsigned int
I like this. Very nice. I think that's the right way to look at it.
I remark that 'ib4' and 'ub4' make for marginally easier parsing and less danger of ambiguity. -- g
Martin v. Löwis wrote:
Travis Oliphant schrieb:
So, the big difference is that I think data-formats should be *instances* of a single type.
This is nearly the case for ctypes as well. All layout descriptions are instances of the type type. Nearly, because they are instances of subtypes of the type type:
py> type(ctypes.c_long)
py> type(ctypes.c_double) py> type(ctypes.c_double).__bases__ ( ,) py> type(ctypes.Structure) py> type(ctypes.Array) py> type(ctypes.Structure).__bases__ ( ,) py> type(ctypes.Array).__bases__ ( ,) So if your requirement is "all layout descriptions ought to have the same type", then this is (nearly) the case: they are instances of type (rather then datatype, as in your PEP).
The big difference, however, is that by going this route you are forced to use the "type object" as your data-format "instance". This is fitting a square peg into a round hole in my opinion. To really be useful, you would need to add the attributes and (most importantly) C-function pointers and C-structure members to these type objects. I don't even think that is possible in Python (even if you do create a meta-type that all the c-type type objects can use that carries the same information). There are a few people claiming I should use the ctypes type-hierarchy but nobody has explained how that would be possible given the attributes, C-structure members and C-function pointers that I'm proposing. In NumPy we also have a Python type for each basic data-format (we call them array scalars). For a little while they carried the data-format information on the Python side. This turned out to be not flexible enough. So, we expanded the PyArray_Descr * structure which has always been a part of Numeric (and the array module array type) into an actual Python type and a lot of things became possible. It was clear to me that we were "on to something". Now, the biggest claim against the gist of what I'm proposing (details we can argue about), seems from my perspective to be a desire to "go backwards" and carry data-type information around with a Python type. The data-type object did not just appear out of thin-air one day. It really can be seen as an evolution from the beginnings of Numeric (and the Python array module). So, this is what we came up with in the NumPy world. Ctypes came up with something a bit different. It is not "trivial" to "just use ctypes." I could say the same thing and tell ctypes to just use NumPy's data-type object. It could be done that way, but of course it would take a bit of work on the part of ctypes to make that happen. Having ctypes in the standard library does not mean that any other discussion of how data-format should be represented has been decided on. If I had known that was what it meant to put ctypes in the standard library, I would have been more vocal several months ago. -Travis
Martin v. Löwis wrote:
Travis Oliphant schrieb:
Function pointers are "supported" with the void data-type and could be more specifically supported if it were important. People typically don't use the buffer protocol to send function-pointers around in a way that the void description wouldn't be enough.
As I said before, I can't tell whether it's important, as I still don't know what the purpose of this PEP is. If it is to support a unification of memory layout specifications, and if that unifications is also to include ctypes, then yes, it is important. If it is to describe array elements in NumArray arrays, then it might not be important.
For the usage of ctypes, the PEP void type is insufficient to describe function pointers: you also need a specification of the signature of the function pointer (parameter types and return type), or else you can't use the function pointer (i.e. you can't call the function).
The buffer protocol is primarily meant for describing the format of (large) contiguous pieces of binary data. In most cases that will be all kinds of numerical data for scientific applications, image and other media data, simple databases and similar kinds of data. There is currently no adequate data format type which sufficiently supports these applications, otherwise Travis wouldn't make this proposal. While Travis' proposal encompasses the data format functionality within the struct module and overlaps with what ctypes has to offer, it does not aim to replace ctypes. I don't think that a basic data format type necessarily should be able to encode all the information a foreign function interface needs to call a code library. From my point of view, that kind of information is one abstraction layer above a basic data format and should be implemented as an extension of or complementary to the basic data format. I also do not understand why the data format type should attempt to fully describe arbitrarily complex data formats, like fragmented (non-continuous) data structures in memory. You'd probably need a full programming language for that anyway. Regards, Stephan
Travis Oliphant schrieb:
The big difference, however, is that by going this route you are forced to use the "type object" as your data-format "instance".
Since everything is an object (an "instance) in Python, this is not such a big difference.
This is fitting a square peg into a round hole in my opinion. To really be useful, you would need to add the attributes and (most importantly) C-function pointers and C-structure members to these type objects.
Can you explain why that is? In the PEP, I see two C fucntions: setitem and getitem. I think they can be implemented readily with ctypes' GETFUNC and SETFUNC function pointers that it uses all over the place. I don't see a requirement to support C structure members or function pointers in the datatype object.
There are a few people claiming I should use the ctypes type-hierarchy but nobody has explained how that would be possible given the attributes, C-structure members and C-function pointers that I'm proposing.
Ok, here you go. Remember, I'm still not claiming that this should be done: I'm just explaining how it could be done. - byteorder/isnative: I think this could be derived from the presence of the _swappedbytes_ field - itemsize: can be done with ctypes.sizeof - kind: can be created through a mapping of the _type_ field (I think) - fields: can be derived from the _fields_ member - hasobject: compare, recursively, with py_object - name: use __name__ - base: again, created from _type_ (if _length_ is present) - shape: recursively look at _length_ - alignment: use ctypes.alignment
It was clear to me that we were "on to something". Now, the biggest claim against the gist of what I'm proposing (details we can argue about), seems from my perspective to be a desire to "go backwards" and carry data-type information around with a Python type.
I, at least, have no such desire. I just explained that the ctypes model of memory layouts is just as expressive as the one in the PEP. Which of these is "better" for what the PEP wants to achieve, I can't say, because I still don't quite understand what the PEP wants to achieve. Regards, Martin
Stephan Tolksdorf schrieb:
While Travis' proposal encompasses the data format functionality within the struct module and overlaps with what ctypes has to offer, it does not aim to replace ctypes.
This discussion could have been a lot shorter if he had said so. Unfortunately (?) he stated that it was *precisely* a motivation of the PEP to provide a standard data description machinery that can then be adopted by the struct, array, and ctypes modules.
I also do not understand why the data format type should attempt to fully describe arbitrarily complex data formats, like fragmented (non-continuous) data structures in memory. You'd probably need a full programming language for that anyway.
For an FFI application, you need to be able to describe arbitrary in-memory formats, since that's what the foreign function will expect. For type safety and reuse, you better separate the description of the layout from the creation of the actual values. Otherwise (i.e. if you have to define the layout on each invocation), creating the parameters for a foreign function becomes very tedious and error-prone, with errors often being catastrophic (i.e. interpreter crashes). Regards, Martin
Martin v. Löwis wrote:
Travis Oliphant schrieb:
The big difference, however, is that by going this route you are forced to use the "type object" as your data-format "instance".
Since everything is an object (an "instance) in Python, this is not such a big difference.
I think it actually is. Perhaps I'm wrong, but a type-object is still a special kind of an instance of a meta-type. I once tried to add function pointers to a type object by inheriting from it. But, I was told that Python is not set up to handle that. Maybe I misunderstood. Let me be very clear. The whole reason I make any statements about ctypes is because somebody else brought it up. I'm not trying to replace ctypes and the way it uses type objects to represent data internally. All I'm trying to do is come up with a way to describe data-types through a buffer protocol. The way ctypes does it is "too" bulky by definining a new Python type for every data-format. While semantically you may talk about the equivalency of types being instances of a "meta-type" and regular objects being instances of a type. My understanding is still that there are practical differences when it comes to implementation --- and certain things that "can't be done" Here's what I mean by the difference. This is akin to what I'm proposing struct { PyObject_HEAD /* whatever you need to represent your instance Quite a bit of flexibility.... */ } PyDataFormatObject; A Python type object (what every C-types data-format "type" inherits from) has a C-structure struct { PyObject_VAR_HEAD char *tp_name; int tp_basicsize, tp_itemsize; /* Methods to implement standard operations */ destructor tp_dealloc; printfunc tp_print; getattrfunc tp_getattr; setattrfunc tp_setattr; cmpfunc tp_compare; reprfunc tp_repr; ... ... PyObject *tp_bases; PyObject *tp_mro; /* method resolution order */ PyObject *tp_cache; PyObject *tp_subclasses; PyObject *tp_weaklist; destructor tp_del; ... /* + more under certain conditions */ } PyTypeObject; Why in the world do we need to carry all this extra baggage around in each data-format instance in order to just describe data? I can see why it's useful for ctypes to do it and that's fine. But, the argument that every exchange of data-format information should use this type-object instance is hard to swallow. So, I'm happy to let ctypes continue on doing what it's doing trusting its developers to have done something good. I'd be happy to drop any reference to ctypes. The only reason to have the data-type objects is something to pass as part of the extended buffer protocol.
Can you explain why that is? In the PEP, I see two C fucntions: setitem and getitem. I think they can be implemented readily with ctypes' GETFUNC and SETFUNC function pointers that it uses all over the place.
Sure, but where do these function pointers live and where are they stored. In ctypes it's in the CField_object. Now, this is closer to what I'm talking about. But, why is not not the same thing. Why, yet another type object to talk about fields of a structure? These are rhetorical questions. I really don't expect or need an answer because I'm not questioning why ctypes did what it did for solving the problem it was solving. I am questioning anyone who claims that we should use this mechanism for describing data-formats in the extended buffer protocol.
I don't see a requirement to support C structure members or function pointers in the datatype object.
There are a few people claiming I should use the ctypes type-hierarchy but nobody has explained how that would be possible given the attributes, C-structure members and C-function pointers that I'm proposing.
Ok, here you go. Remember, I'm still not claiming that this should be done: I'm just explaining how it could be done.
O.K. Thanks for putting in the effort. It doesn't answer my real concerns, though.
It was clear to me that we were "on to something". Now, the biggest claim against the gist of what I'm proposing (details we can argue about), seems from my perspective to be a desire to "go backwards" and carry data-type information around with a Python type.
I, at least, have no such desire. I just explained that the ctypes model of memory layouts is just as expressive as the one in the PEP.
I agree with this. I'm very aware of what "can" be expressed. I just think it's too awkard and bulky to use in the extended buffer protocol
Which of these is "better" for what the PEP wants to achieve, I can't say, because I still don't quite understand what the PEP wants to achieve.
Are you saying you still don't understand after having read the extended buffer protocol PEP, yet? -Travis
Martin v. Löwis wrote:
Stephan Tolksdorf schrieb:
While Travis' proposal encompasses the data format functionality within the struct module and overlaps with what ctypes has to offer, it does not aim to replace ctypes.
This discussion could have been a lot shorter if he had said so. Unfortunately (?) he stated that it was *precisely* a motivation of the PEP to provide a standard data description machinery that can then be adopted by the struct, array, and ctypes modules.
Struct and array I was sure about. Ctypes less sure. I'm very sorry for the distraction I caused by mis-stating my objective. My objective is really the extended buffer protocol. The data-type object is a means to that end. I do think ctypes could make use of the data-type object and that there is a real difference between using Python type objects as data-format descriptions and using another Python type for those descriptions. I thought to go the ctypes route (before I even knew what ctypes did) but decided against it for a number of reasons. But, nonetheless those are side issues. The purpose of the PEP is to provide an object that the extended buffer protocol can use to share data-format information. It should be considered primarily in that context. -Travis
Travis Oliphant schrieb:
I think it actually is. Perhaps I'm wrong, but a type-object is still a special kind of an instance of a meta-type. I once tried to add function pointers to a type object by inheriting from it. But, I was told that Python is not set up to handle that. Maybe I misunderstood.
I'm not quite sure what the problems are: one "obvious" problem is that the next Python version may also extend the size of type objects. But, AFAICT, even that should "work", in the sense that this new version should check for the presence of a flag to determine whether the additional fields are there. The only tricky question is how you can find out whether your own extension is there. If that is a common problem, I think a framework could be added to support extensible type objects (with some kind of registry for additional fields, and a per-type-object indicator whether a certain extension field is present).
Let me be very clear. The whole reason I make any statements about ctypes is because somebody else brought it up. I'm not trying to replace ctypes and the way it uses type objects to represent data internally.
Ok. I understood you differently earlier. Regards, Martin
On 10/31/06, Travis Oliphant
Martin v. Löwis wrote:
[...] because I still don't quite understand what the PEP wants to achieve.
Are you saying you still don't understand after having read the extended buffer protocol PEP, yet?
I can't speak for Martin, but I don't understand how I, as a Python programmer, might use the data type objects specified in the PEP. I have skimmed the extended buffer protocol PEP, but I'm conscious that no objects I currently use support the extended buffer protocol (and the PEP doesn't mention adding support to existing objects), so I don't see that as too relevant to me. I have also installed numpy, and looked at the help for numpy.dtype, but that doesn't add much to the PEP. The freely available chapters of the numpy book explain how dtypes describe data structures, but not how to use them. The freely available Numeric documentation doesn't refer to dtypes, as far as I can tell. Is there any documentation on how to use dtypes, independently of other features of numpy? If not, can you clarify where the benefit lies for a Python user of this proposal? (I understand the benefits of a common language for extensions to communicate datatype information, but why expose it to Python? How do Python users use it?) This is probably all self-evident to the numpy community, but I think that as the PEP is aimed at a wider audience it needs a little more background. Paul.
"Paul Moore"
On 10/31/06, Travis Oliphant
wrote: Martin v. Löwis wrote:
[...] because I still don't quite understand what the PEP wants to achieve.
Are you saying you still don't understand after having read the extended buffer protocol PEP, yet?
I can't speak for Martin, but I don't understand how I, as a Python programmer, might use the data type objects specified in the PEP. I have skimmed the extended buffer protocol PEP, but I'm conscious that no objects I currently use support the extended buffer protocol (and the PEP doesn't mention adding support to existing objects), so I don't see that as too relevant to me.
Presumably str in 2.x and bytes in 3.x could be extended to support the 'S' specifier, unicode in 2.x and text in 3.x could be extended to support the 'U' specifier. The various array.array variants could be extended to support all relevant specifiers, etc.
This is probably all self-evident to the numpy community, but I think that as the PEP is aimed at a wider audience it needs a little more background.
Someone correct me if I am wrong, but it allows things equivalent to the following that is available in C, available in Python... typedef struct { char R; char G; char B; char A; } pixel_RGBA; pixel_RGBA image[1024][768]; Or even... typedef struct { long long numerator; unsigned long long denominator; double approximation; } rational; rational ratios[1024]; The real use is that after you have your array of (packed) objects, be it one of the above samples, or otherwise, you don't need to explicitly pass around specifiers (like in struct, or ctypes), numpy and others can talk to each other, and pick up the specifier with the extended buffer protocol, and it just works. - Josiah
Paul Moore wrote:
On 10/31/06, Travis Oliphant
wrote: Martin v. Löwis wrote:
[...] because I still don't quite understand what the PEP wants to achieve.
Are you saying you still don't understand after having read the extended buffer protocol PEP, yet?
I can't speak for Martin, but I don't understand how I, as a Python programmer, might use the data type objects specified in the PEP. I have skimmed the extended buffer protocol PEP, but I'm conscious that no objects I currently use support the extended buffer protocol (and the PEP doesn't mention adding support to existing objects), so I don't see that as too relevant to me.
Do you use the PIL? The PIL supports the array interface. CVXOPT supports the array interface. Numarray Numeric NumPy all support the array interface.
I have also installed numpy, and looked at the help for numpy.dtype, but that doesn't add much to the PEP.
The source-code is available.
The freely available chapters of the numpy book explain how dtypes describe data structures, but not how to use them.
The freely available Numeric documentation doesn't
refer to dtypes, as far as I can tell.
It kind of does, they are PyArray_Descr * structures in Numeric. They just aren't Python objects. Is there any documentation on
how to use dtypes, independently of other features of numpy?
There are examples and other help pages at http://www.scipy.org If not,
can you clarify where the benefit lies for a Python user of this proposal? (I understand the benefits of a common language for extensions to communicate datatype information, but why expose it to Python? How do Python users use it?)
The only benefit I imagine would be for an extension module library writer and for users of the struct and array modules. But, other than that, I don't know. It actually doesn't have to be exposed to Python. I used Python notation in the PEP to explain what is basically a C-structure. I don't care if the object ever gets exposed to Python. Maybe that's part of the communication problem.
This is probably all self-evident to the numpy community, but I think that as the PEP is aimed at a wider audience it needs a little more background.
It's hard to write that background because most of what I understand is from the NumPy community. I can't give you all the examples but my concern is that you have all these third party libraries out there describing what is essentially binary data and using either string-copies or the buffer protocol + extra information obtained by some method or attribute that varies across the implementations. There should really be a standard for describing this data. There are attempts at it in the struct and array module. There is the approach of ctypes but I claim that using Python type objects is over-kill for the purposes of describing data-formats. -Travis
The only benefit I imagine would be for an extension module library writer and for users of the struct and array modules. But, other than that, I don't know. It actually doesn't have to be exposed to Python. I used Python notation in the PEP to explain what is basically a C-structure. I don't care if the object ever gets exposed to Python.
Maybe that's part of the communication problem.
I get the impression where ctypes is good for accessing native C libraries from within python, the data-type object is meant to add a more direct way to share native python object's *data* with C (or other languages) in a more efficient way. For data that can be represented well in continuous memory address's, it lightens the load so instead of a list of python objects you get an "array of data for n python_type objects" without the duplications of the python type for every element. I think maybe some more complete examples demonstrating how it is to be used from both the Python and C would be good. Cheers, Ron
One thing I'm curious about in the ctypes vs this PEP debate is the following. How do the approaches differ in practice if I'm developing a library that wants to accept various image formats that all describe the same thing: rgb data. Let's say for now all I want to support is two different image formats whose pixels are described in C structs by: struct rbg565 { unsigned short r:5; unsigned short g:6; unsigned short b:5; }; struct rgb101210 { unsigned int r:10; unsigned int g:12; unsigned int b:10; }; Basically in my code I want to be able to take the binary data descriptor and say "give me the 'r' field of this pixel as an integer". Is either one (the PEP or c-types) clearly easier to use in this case? What would the code look like for handling both formats generically? --bb
Bill Baxter schrieb:
Basically in my code I want to be able to take the binary data descriptor and say "give me the 'r' field of this pixel as an integer".
Is either one (the PEP or c-types) clearly easier to use in this case? What would the code look like for handling both formats generically?
The PEP, as specified, does not support accessing individual fields from Python. OTOH, ctypes, as implemented, does. This comparison is not fair, though: an *implementation* of the PEP (say, NumPy) might also give you Python-level access to the fields. With the PEP, you can get access to the 'r' field from C code. Performing this access is quite tedious; as I'm uncertain whether you actually wanted to see C code, I refrain from trying to formulate it. Regards, Martin
Martin v. Löwis
Bill Baxter schrieb:
Basically in my code I want to be able to take the binary data descriptor and say "give me the 'r' field of this pixel as an integer".
Is either one (the PEP or c-types) clearly easier to use in this case? What would the code look like for handling both formats generically?
The PEP, as specified, does not support accessing individual fields from Python. OTOH, ctypes, as implemented, does. This comparison is not fair, though: an *implementation* of the PEP (say, NumPy) might also give you Python-level access to the fields.
I see. So at the Python-user convenience level it's pretty much a wash. Are there significant differences in memory usage and/or performance? ctypes sounds to be more heavyweight from the discussion. If I have a lot of image formats I want to support is that going to mean lots of overhead with ctypes? Do I pay for it whether or not I actually end up having to handle an image in a given format?
With the PEP, you can get access to the 'r' field from C code. Performing this access is quite tedious; as I'm uncertain whether you actually wanted to see C code, I refrain from trying to formulate it.
Actually this is more what I was after. I've written C code to interface with Numpy arrays and found it to be not so bad. But the data I was passing around was just a plain N-dimensional array of doubles. Very basic. It *sounds* like what Travis is saying is that handling a less simple case, like the one above of supporting a variety of RGB image formats, would be easier with the PEP than with ctypes. Or maybe it's generating the data in my C code that's trickier, as opposed to consuming it? I'm just trying to understand what the deal is, and at the same time perhaps inject a more concrete example into the discussion. Travis has said several times that working with ctypes, which requires a Python type per 'element', is more complicated from the C side, and I'd like to see more concretely how so, as someone who may end up needing to write such code. And I'm ok without seeing the actual code if someone can actually answer my question. The question is not whether it is tedious or not -- everything about the Python C API is tedious from what I've seen. The question is which is *more* tedious, and how significan is the difference in tediousness to the guy who's job it is to actually write the code. --bb
Paul Moore wrote:
Enough of the abstract. As a concrete example, suppose I have a (byte) string in my program containing some binary data - an ID3 header, or a TCP packet, or whatever. It doesn't really matter. Does your proposal offer anything to me in how I might manipulate that data (assuming I'm not using NumPy)? (I'm not insisting that it should, I'm just trying to understand the scope of the PEP).
What do you mean by "manipulate the data." The proposal for a data-format object would help you describe that data in a standard way and therefore share that data between several library that would be able to understand the data (because they all use and/or understand the default Python way to handle data-formats). It would be up to the other packages to "manipulate" the data. So, what you would be able to do is take your byte-string and create a buffer object which you could then share with other packages: Example: b = buffer(bytestr, format=data_format_object) Now. a = numpy.frombuffer(b) a['field1'] # prints data stored in the field named "field1" etc. Or. cobj = ctypes.frombuffer(b) # Now, cobj is a ctypes object that is basically a "structure" that can be passed # directly to your C-code. Does this help? -Travis
On 11/2/06, Travis Oliphant
What do you mean by "manipulate the data." The proposal for a data-format object would help you describe that data in a standard way and therefore share that data between several library that would be able to understand the data (because they all use and/or understand the default Python way to handle data-formats).
It would be up to the other packages to "manipulate" the data.
Yes, some other messages I read since I posted this clarified it for me. Essentially, as a Python programmer, there's nothing in the PEP for me - it's for extension writers (and maybe writers of some lower-level Python modules? I'm not sure about this). So as I'm not really the target audience, I won't comment further.
So, what you would be able to do is take your byte-string and create a buffer object which you could then share with other packages:
Example:
b = buffer(bytestr, format=data_format_object)
Now.
a = numpy.frombuffer(b) a['field1'] # prints data stored in the field named "field1"
etc.
Or.
cobj = ctypes.frombuffer(b)
# Now, cobj is a ctypes object that is basically a "structure" that can be passed # directly to your C-code.
Does this help?
Somewhat. My understanding is that the python-level buffer object is frowned upon as not good practice, and is scheduled for removal at some point (Py3K, quite possibly?) Hence, any code that uses buffer() feels like it "needs" to be replaced by something "more acceptable". So although I understand the use you suggest, it's not compelling to me because I am left with the feeling that I wish I knew "the way to do it that didn't need the buffer object" (even though I realise intellectually that such a way may not exist). Paul.
Paul Moore
Somewhat. My understanding is that the python-level buffer object is frowned upon as not good practice, and is scheduled for removal at some point (Py3K, quite possibly?) Hence, any code that uses buffer() feels like it "needs" to be replaced by something "more acceptable".
Python 2.x buffer object serves two distinct purposes. First, it is a "mutable string" object and this is definitely not going away being replaced by the bytes object. (Interestingly, this functionality is not exposed to python, but C extension modules can call PyBuffer_New(size) to create a buffer.) Second, it is a "view" into any object supporting buffer protocol. For a while this usage was indeed frowned upon because buffer objects held the pointer obtained from bf_get*buffer for too long causing memory errors in situations like this:
a = array('c', "x"*10) b = buffer(a, 5, 2) a.extend('x'*1000) str(b) 'xx'
This problem was fixed more than two years ago. ------ r35400 | nascheme | 2004-03-10 Make buffer objects based on mutable objects (like array) safe. ------ Even though it was suggested in the past that buffer *object* should be deprecated as unsafe, I don't remember seeing a call to deprecate the buffer protocol.
So although I understand the use you suggest, it's not compelling to me because I am left with the feeling that I wish I knew "the way to do it that didn't need the buffer object" (even though I realise intellectually that such a way may not exist).
As I explained in another post, I used buffer object as an example of an object that supports buffer protocol, but does not export type information in the form usable by numpy. Here is another way to illustrate the problem:
a = numpy.array(array.array('H', [1,2,3])) b = numpy.array([1,2,3],dtype='H') a.dtype == b.dtype False
With the extended buffer protocol it will be possible for numpy.array(..) to realize that array.array('H', [1,2,3]) is a sequence of unsigned short integers and convert it accordingly. Currently numpy has to go through the sequence protocol to create a numpy.array from an array.array and loose the type information.
Travis Oliphant wrote:
Paul Moore wrote:
Enough of the abstract. As a concrete example, suppose I have a (byte) string in my program containing some binary data - an ID3 header, or a TCP packet, or whatever. It doesn't really matter. Does your proposal offer anything to me in how I might manipulate that data (assuming I'm not using NumPy)? (I'm not insisting that it should, I'm just trying to understand the scope of the PEP).
What do you mean by "manipulate the data." The proposal for a data-format object would help you describe that data in a standard way and therefore share that data between several library that would be able to understand the data (because they all use and/or understand the default Python way to handle data-formats).
Perhaps the most relevant thing to pull from this conversation is back to what Martin has asked about before: "flexible array members". A TCP packet has no defined length (there isn't even a header field in the packet for this, so in fairness we can talk about IP packets which do). There is no way for me to describe this with the pre-PEP data-formats. I feel like it is misleading of you to say "it's up to the package to do manipulations," because you glanced over the fact that you can't even describe this type of data. ISTM, that you're only interested in describing repetitious fixed-structure arrays. If we are going to have a "default Python way to handle data-formats", then don't you feel like this falls short of the mark? I fear that you speak about this in too grandiose terms and are now trapped by people asking, "well, can I do this?" I think for a lot of folks the answer is: "nope." With respect to the network packets, this PEP doesn't do anything to fix the communication barrier. Is this not in the scope of "a consistent and standard way to discuss the format of binary data" (which is what your PEP's abstract sets out as the task)? -- Scott Dial scott@scottdial.com scodial@cs.indiana.edu
Perhaps the most relevant thing to pull from this conversation is back to what Martin has asked about before: "flexible array members". A TCP packet has no defined length (there isn't even a header field in the packet for this, so in fairness we can talk about IP packets which do). There is no way for me to describe this with the pre-PEP data-formats.
I feel like it is misleading of you to say "it's up to the package to do manipulations," because you glanced over the fact that you can't even describe this type of data. ISTM, that you're only interested in describing repetitious fixed-structure arrays.
Yes, that's right. I'm only interested in describing binary data with a fixed length. Others can help push it farther than that (if they even care).
If we are going to have a "default Python way to handle data-formats", then don't you feel like this falls short of the mark? Not for me. We can fix what needs fixing, but not if we can't get out of the gate.
I fear that you speak about this in too grandiose terms and are now trapped by people asking, "well, can I do this?" I think for a lot of folks the answer is: "nope." With respect to the network packets, this PEP doesn't do anything to fix the communication barrier.
Yes it could if you were interested in pushing it there. No, I didn't solve that particular problem with the PEP (because I can only solve the problems I'm aware of), but I do think the problem could be solved. We have far too many nay-sayers on this list, I think. Right now, I don't have time to push this further. My real interest is the extended buffer protocol. I want something that works for that. When I do have time again to discuss it again, I might come back and push some more. But, not now. -Travis
participants (20)
-
"Martin v. Löwis"
-
Alexander Belopolsky
-
Armin Rigo
-
Bill Baxter
-
Diez B. Roggisch
-
Fredrik Lundh
-
Gareth McCaughan
-
Greg Ewing
-
Jack Jansen
-
Josiah Carlson
-
M.-A. Lemburg
-
Neal Becker
-
Nick Coghlan
-
Paul Moore
-
Robert Kern
-
Ron Adam
-
Scott Dial
-
Stephan Tolksdorf
-
Travis E. Oliphant
-
Travis Oliphant