
Christopher Hanley wrote:
Hi Travis,
About a year ago (summer 2004) on the numpy distribution list there was a lot of discussion of the records interface. I will dig through my notes and put together a summary.
Thanks for the pointers. I had forgotten about that discussion. I went back and re-read the thread. Here's a good link for others to re-read (the end of) this thread: http://news.gmane.org/find-root.php?message_id=%3cBD22BAC0.E9EB%25perry%40st... I think some very good points were made. These points should be addressed from the context of scipy arrays which now support records in a very basic way. Because of this, we can support nested records of records --- but how is this to be presented to the user is still an open question (i.e. how do you build one...) I've finally been converted to believe that the notion of records is very important because it speaks of how to do the basic (typeless, mathless) array object that will go into Python correctly If we can get the general records type done right, then all the other types are examples of it. Thus, I would like to revive discussion of the record object for inclusion in scipy core. I pretty much agree with the semantics that Perry described in his final email (is this all implemented in numarray, yet?), except I would agree with Francesc Alted that a titles or labels concept should be allowed. I'm more enthusiastic about code than discussion, so I'm hoping for a short-lived discussion followed by actual code. I'm ready to do the implementation this week (I've already borrowed lots of great code from numarray which makes it easier), but feel free to chime in even if you read this later. In my mind, the discussion about the records array is primarily a discussion about the records data-type. The way I'm thinking, the scipy ndarray is a homogeneous collection of the same "thing." The big change in scipy core is that Numeric used to allow only certain data types, but now the ndarray can contain an arbitrary "void" data type. You can also add data-types to scipy core. These data-types are "almost" full members of the scipy data-type community. The "almost" is because the N*N casting matrix is not updated (this would require a re-design of how casting is considered). At some point, I'd like to fix this wart and make it so that data-types can be added at will -- I think if we get the record type right, I'll be able to figure out how to do this. We need to add a "record" data-type to scipy. Then, any array can be of "record" type, and there will be an additional "array scalar" that is what is returned when selecting a single element from the array. So, a record array would simply be an array of "records" plus some extra stuff for dealing with the mapping from field names to actual segments of the array element (we may decide that this mapping is general enough that all scipy arrays should have the capability of assigning names to sub-bytes of its main data-type and means of accessing those sub-bytes in which case the subclass is unnecessary). Let me explain further: Right now, the machinery is in place in scipy_core to get and set in any ndarray (regardless of its data-type) an arbitrary "field". A "field" in this context is defined as a sub-section of the basic element making up the array. Generically the sub-section is defined by an offset and a data-type or a tuple of a data type and a shape (to allow sub-arrays in a record). What I understand the user to want is the binding of a name to this generic sub-section descriptor. 1) Should we allow that for every scipy ndarray: complex data types have an obvious binding, would anybody want to name the first two bytes of their int32 array? I suggest holding off on this one until a records array is working.... 2) Supposing we don't go with number 1, we need to design a record data type that has this name-binding capability. The recarray class in scipy core SVN essentially just does this. Question: How important is backwards compatibility with old numarray specification. In particular, I would go with the .fields access described by Perry, and eliminate the .field() approach? Thanks for reading and any comments you can make. -Travis

Travis Oliphant wrote:
Thus, I would like to revive discussion of the record object for inclusion in scipy core. I pretty much agree with the semantics that Perry described in his final email (is this all implemented in numarray, yet?),
No, it was only talk to date, with plans to implment it, but other things have drawn our work up to now.
Question: How important is backwards compatibility with old numarray specification. In particular, I would go with the .fields access described by Perry, and eliminate the .field() approach?
For us, probably not critical since we have to do some rewriting anyway. (But it would be nice to retain for a while as deprecated). But what about field names that don't map well to attributes? I haven't had a chance to reread the past emails but I seem to recall this was a significant issue. That would imply that .field() would be needed for those cases anyway. Perry

Perry Greenfield wrote:
Travis Oliphant wrote:
Question: How important is backwards compatibility with old numarray specification. In particular, I would go with the .fields access described by Perry, and eliminate the .field() approach?
For us, probably not critical since we have to do some rewriting anyway. (But it would be nice to retain for a while as deprecated).
But what about field names that don't map well to attributes? I haven't had a chance to reread the past emails but I seem to recall this was a significant issue. That would imply that .field() would be needed for those cases anyway.
I haven't read the above thread extensively, but the issue of field names that don't map well to attributes is significant. For example, users of pytables often have columns with names that are not valid Python names. So, regardless of what solution is the most obvious, there should at least be a way to get as such field names. (pytables users are used to doing that.) Cheers! Andrew

Perry Greenfield wrote:
For us, probably not critical since we have to do some rewriting anyway. (But it would be nice to retain for a while as deprecated).
Easy enough to do by defining an actual record array (however, see below). I've been retaining backwards compatibility in other ways while not documenting it. For example, you can actually now pass in strings like 'Int32' for types.
But what about field names that don't map well to attributes? I haven't had a chance to reread the past emails but I seem to recall this was a significant issue. That would imply that .field() would be needed for those cases anyway.
What I'm referring to as the solution here is a slight modification to what Perry described. In other words, all arrays have the attribute .fields You can set this attribute to a dictionary which will automagically gives field names to any array (this dictionary has ordered lists of 'names', (optionally) 'titles', and "(data-descr, [offset])" lists which defines the mapping. If offset is not given, then the "next-available" offset is assumed. The data-descr is either 1) a data-type or 2) a tuple of (data-type, shape). The data-type is either a defined data-type or alias, or an object with a .fields attribute that provides the same dictionary and an .itemsize attribute that computes the total size of the data-type. You can get this attribute which returns a special fields object (written in Python initially like the flags attribute) that can look up field names like a dictionary, or with attribute access for names that are either 1) acceptable or 2) have a user-provided "python-name" associated with them. Thus, .fields['home address'] would always work but .fields.hmaddr would only work if the user had previously made the association hmaddr -> 'home address' for the data type of this array. Thus 'home address' would be a title but hmaddr would be the name. The records module would simply provide functions for making record arrays and a record data type. Driving my thinking is the concept that the notion of a record array is really a description of the data type of the array (not the array itself). Thus, all the fields information should really just be part of the data type itself. Now, I don't really want to create and register a new data type every time somebody has a new record layout. So, I've been re-thinking the notion of "registering a data-type". It seems to me that while it's O.K. to have a set of pre-defined data types. The notion of data-type ought to be flexible enough to allow the user to define one "on-the-fly". I'm thinking of ways to do this right now. Any suggestions are welcome. -Travis

Travis Oliphant wrote:
So, I've been re-thinking the notion of "registering a data-type". It seems to me that while it's O.K. to have a set of pre-defined data types. The notion of data-type ought to be flexible enough to allow the user to define one "on-the-fly". I'm thinking of ways to do this right now. Any suggestions are welcome.
I'm doing that in an application I'm developing. My objects have an attribute called '_schema' that is an instance of Zope InterfaceClass. An object (read "record" ;) is assigned a _schema when it is instantiated, and all information about its attributes (a.k.a. "fields") is contained in the _schema's Properties (my 'Property' subtypes the Zope interfaces 'Attribute' type, and has a host of (meta-)attributes like 'domain', 'range', 'id', 'name', etc. -- which could easily be extended to include things like 'title', but I use another mechanism for display characteristics, called 'DisplayMap', which can be used to specify the order in which you want the object's properties to appear in a grid, what you want their "display names" to be, etc. ... which are also customizable by the end-user. Let me know if this sounds interesting. Cheers, Steve

Stephen Waterbury wrote:
Travis Oliphant wrote:
So, I've been re-thinking the notion of "registering a data-type". It seems to me that while it's O.K. to have a set of pre-defined data types. The notion of data-type ought to be flexible enough to allow the user to define one "on-the-fly". I'm thinking of ways to do this right now. Any suggestions are welcome.
I'm doing that in an application I'm developing. My objects have an attribute called '_schema' that is an instance of Zope InterfaceClass. An object (read "record" ;) is assigned a _schema when it is instantiated, and all information about its attributes (a.k.a. "fields") is contained in the _schema's Properties (my 'Property' subtypes the Zope interfaces 'Attribute' type, and has a host of (meta-)attributes like 'domain', 'range', 'id', 'name', etc. -- which could easily be extended to include things like 'title', but I use another mechanism for display characteristics, called 'DisplayMap', which can be used to specify the order in which you want the object's properties to appear in a grid, what you want their "display names" to be, etc. ... which are also customizable by the end-user.
Let me know if this sounds interesting.
Cheers, Steve
This is goes further than my suggestion. For arrays, it seems to me that an additional pointer to _schema is not needed as there is a pointer to the data type and the data type can contain the meta data Colin W.

Colin J. Williams wrote:
Stephen Waterbury wrote:
I'm doing that in an application I'm developing. My objects have an attribute called '_schema' that is an instance of Zope InterfaceClass. An object (read "record" ;) is assigned a _schema when it is instantiated, and all information about its attributes (a.k.a. "fields") is contained in the _schema's Properties (my 'Property' subtypes the Zope interfaces 'Attribute' type, and has a host of (meta-)attributes like 'domain', 'range', 'id', 'name', etc. -- which could easily be extended to include things like 'title', but I use another mechanism for display characteristics, called 'DisplayMap', which can be used to specify the order in which you want the object's properties to appear in a grid, what you want their "display names" to be, etc. ... which are also customizable by the end-user.
This is goes further than my suggestion. For arrays, it seems to me that an additional pointer to _schema is not needed as there is a pointer to the data type and the data type can contain the meta data
In effect, the _schema *is* the data type in my scenario. It contains all information necessary to describe the type of the object and has references to the types of all the object's attributes (which are called "fields" in record parlance, "Properties" in the world of ontologies, and "Attributes" in Zope Interfaces and UML terminology). Steve

Colin J. Williams wrote:
Stephen Waterbury wrote:
I'm doing that in an application I'm developing. My objects have an attribute called '_schema' that is an instance of Zope InterfaceClass. An object (read "record" ;) is assigned a _schema when it is instantiated, and all information about its attributes (a.k.a. "fields") is contained in the _schema's Properties (my 'Property' subtypes the Zope interfaces 'Attribute' type, and has a host of (meta-)attributes like 'domain', 'range', 'id', 'name', etc. -- which could easily be extended to include things like 'title', but I use another mechanism for display characteristics, called 'DisplayMap', which can be used to specify the order in which you want the object's properties to appear in a grid, what you want their "display names" to be, etc. ... which are also customizable by the end-user.
This is goes further than my suggestion. For arrays, it seems to me that an additional pointer to _schema is not needed as there is a pointer to the data type and the data type can contain the meta data
In effect, the _schema *is* the data type in my scenario. It contains all information necessary to describe the type of the object and references to the types of all the object's attributes (which are called "fields" in record parlance, "Properties" in the world of ontologies, and "Attributes" in Zope Interfaces and UML terminology). Steve

Travis Oliphant wrote:
Perry Greenfield wrote:
For us, probably not critical since we have to do some rewriting anyway. (But it would be nice to retain for a while as deprecated).
Easy enough to do by defining an actual record array (however, see below). I've been retaining backwards compatibility in other ways while not documenting it. For example, you can actually now pass in strings like 'Int32' for types.
But what about field names that don't map well to attributes? I haven't had a chance to reread the past emails but I seem to recall this was a significant issue. That would imply that .field() would be needed for those cases anyway.
What I'm referring to as the solution here is a slight modification to what Perry described. In other words, all arrays have the attribute
.fields
What I suggested in my posting was that there is no need and no benefit from the .fields attribute. The base class Record could be organized so that certain attributes which are used in arrays are not acceptable. For example, one would probably wish to avoid shape, size and the other attributes of the basic array but attributes associated with arrays with numeric types would probably not need to be barred.
You can set this attribute to a dictionary which will automagically gives field names to any array (this dictionary has ordered lists of 'names', (optionally) 'titles', and "(data-descr, [offset])" lists which defines the mapping. If offset is not given, then the "next-available" offset is assumed. The data-descr is either 1) a data-type or 2) a tuple of (data-type, shape). The data-type is either a defined data-type or alias, or an object with a .fields attribute that provides the same dictionary and an .itemsize attribute that computes the total size of the data-type.
I wonder about the need for explicit dictionary operations. Can't this be handled through the class structure?
You can get this attribute which returns a special fields object (written in Python initially like the flags attribute) that can look up field names like a dictionary, or with attribute access for names that are either 1) acceptable or 2) have a user-provided "python-name" associated with them. Thus,
.fields['home address']
would always work
but
.fields.hmaddr
would only work if the user had previously made the association hmaddr -> 'home address' for the data type of this array. Thus 'home address' would be a title but hmaddr would be the name.
The records module would simply provide functions for making record arrays and a record data type. Driving my thinking is the concept that the notion of a record array is really a description of the data type of the array (not the array itself). Thus, all the fields information should really just be part of the data type itself. Now, I don't really want to create and register a new data type every time somebody has a new record layout.
A record array is an array which has a record as its data element, in the same way that an integer array has an integer as its element. I don't understand the notion of registring a data type. Presumably an integer array has a pointer to the appropriate type of integer. Could the record array not have a pointer to the appropriate record type?
So, I've been re-thinking the notion of "registering a data-type". It seems to me that while it's O.K. to have a set of pre-defined data types. The notion of data-type ought to be flexible enough to allow the user to define one "on-the-fly". I'm thinking of ways to do this right now. Any suggestions are welcome.
The record types would be created "on-the-fly" as the class is instatiated. The array, through the dtype parameter would point to the record type.
-Travis
Colin W.

Travis Oliphant wrote:
Christopher Hanley wrote:
Hi Travis,
About a year ago (summer 2004) on the numpy distribution list there was a lot of discussion of the records interface. I will dig through my notes and put together a summary.
Thanks for the pointers. I had forgotten about that discussion. I went back and re-read the thread.
Here's a good link for others to re-read (the end of) this thread:
http://news.gmane.org/find-root.php?message_id=%3cBD22BAC0.E9EB%25perry%40st...
I think some very good points were made. These points should be addressed from the context of scipy arrays which now support records in a very basic way. Because of this, we can support nested records of records --- but how is this to be presented to the user is still an open question (i.e. how do you build one...)
I've finally been converted to believe that the notion of records is very important because it speaks of how to do the basic (typeless, mathless) array object that will go into Python correctly If we can get the general records type done right, then all the other types are examples of it.
Thus, I would like to revive discussion of the record object for inclusion in scipy core. I pretty much agree with the semantics that Perry described in his final email (is this all implemented in numarray, yet?), except I would agree with Francesc Alted that a titles or labels concept should be allowed. I'm more enthusiastic about code than discussion, so I'm hoping for a short-lived discussion followed by actual code. I'm ready to do the implementation this week (I've already borrowed lots of great code from numarray which makes it easier), but feel free to chime in even if you read this later.
In my mind, the discussion about the records array is primarily a discussion about the records data-type. The way I'm thinking, the scipy ndarray is a homogeneous collection of the same "thing." The big change in scipy core is that Numeric used to allow only certain data types, but now the ndarray can contain an arbitrary "void" data type. You can also add data-types to scipy core. These data-types are "almost" full members of the scipy data-type community. The "almost" is because the N*N casting matrix is not updated (this would require a re-design of how casting is considered). At some point, I'd like to fix this wart and make it so that data-types can be added at will -- I think if we get the record type right, I'll be able to figure out how to do this.
We need to add a "record" data-type to scipy. Then, any array can be of "record" type, and there will be an additional "array scalar" that is what is returned when selecting a single element from the array. So, a record array would simply be an array of "records" plus some extra stuff for dealing with the mapping from field names to actual segments of the array element (we may decide that this mapping is general enough that all scipy arrays should have the capability of assigning names to sub-bytes of its main data-type and means of accessing those sub-bytes in which case the subclass is unnecessary). Let me explain further: Right now, the machinery is in place in scipy_core to get and set in any ndarray (regardless of its data-type) an arbitrary "field". A "field" in this context is defined as a sub-section of the basic element making up the array. Generically the sub-section is defined by an offset and a data-type or a tuple of a data type and a shape (to allow sub-arrays in a record). What I understand the user to want is the binding of a name to this generic sub-section descriptor. 1) Should we allow that for every scipy ndarray: complex data types have an obvious binding, would anybody want to name the first two bytes of their int32 array? I suggest holding off on this one until a records array is working....
2) Supposing we don't go with number 1, we need to design a record data type that has this name-binding capability.
The recarray class in scipy core SVN essentially just does this.
Question: How important is backwards compatibility with old numarray specification. In particular, I would go with the .fields access described by Perry, and eliminate the .field() approach?
I feel that it is not particularly important. Having a good design is the thing to shoot for.
Thanks for reading and any comments you can make.
-Travis
I'm not clear as to what the current design objective is and so I'll try to recap and perhaps expand my pieces in the referenced discussion to set out the sort of arrangement I would like to see. We are moving towards having a multi-dimensional array which can hold objects of fixed size and type, the smallest being one byte (although the void would appear to be a collection of no size objects). Most of the need, and thus the focus, is on numeric objects, ranging in size from Int8 to Complex64. The Record is a fixed size object containing fields. Each field has a name, an optional title and data of a fixed type (perhaps including another record instance and maybe arrays of fixed size?). In the example below, AddressRecord and PersonRecord would be sub-classes of Record where the fields are named and, optionally, field titles given. The names would be consistent with Python naming whereas the title could be any Python string. The use of attributes raises the possibility that one could have nested records. For example, suppose one has an address record: addressRecord streetNumber streetName postalCode ... There could then be a personal record: personRecord ... officeAddress homeAddress ... One could address a component as rec.homeAddress.postalCode. Suppose one has a (n, n) array of persons then one could access the data in the following ways: persons[1] all records in the second row persons[:,1] all records in the second column persons[1, 1] return a specific person record persons[1, 1].homeAddress the home address record for a specific person persons[1, 1].homeAddress.postalCode the postal code for a specific person persons.homeAddress.postalCode an (n, n) array containing all postal codes persons.homeAddress.postalCode.title could be 'Zip Code' I see no need to have the attribute 'field' and would like to avoid the use of strings to identify a record component. This does require that fields be named as Python identifiers but is this restriction a killer? Colin W.

I'm not clear as to what the current design objective is and so I'll try to recap and perhaps expand my pieces in the referenced discussion to set out the sort of arrangement I would like to see.
I have two objectives: 1) Make the core scipy array object flexible enough to support a very good records sub-class. In other works, I wonder if the core scipy array object could be made flexible enough to be used as a decent record array by itself, without adding much difficulty. In the process, I'm really trying to understand how the data-type of an array should be generally considered. An array object that has this generic perspective on data-type is what should go into Python, I believe. 2) Make a (more) useful records subclass of the ndarray object that is perhaps easier for the end-user to use. Involved with this, of course, is making functions that make it easy to create a records sub-class.
We are moving towards having a multi-dimensional array which can hold objects of fixed size and type, the smallest being one byte (although the void would appear to be a collection of no size objects). Most of the need, and thus the focus, is on numeric objects, ranging in size from Int8 to Complex64.
The Record is a fixed size object containing fields. Each field has a name, an optional title and data of a fixed type (perhaps including another record instance and maybe arrays of fixed size?).
Right, so the record is really another kind of data-type. The concept of the multidimensional array does not need adjustment, but the concept of what constitutes a data-type may need some fixing up.
In the example below, AddressRecord and PersonRecord would be sub-classes of Record where the fields are named and, optionally, field titles given. The names would be consistent with Python naming whereas the title could be any Python string.
I like the notion of titles and names. I think they are both useful.
The use of attributes raises the possibility that one could have nested records. For example, suppose one has an address record:
Now, I'm in favor of attribute access. But, nested records is possible without attribute access (it's just extra typing). It's the underlying structure that provides the possibility for nested records (and that's what I'm trying to figure out how to support, generally). If I can support this generally in the basic ndarray object by augmenting the notion of data-type as appropriate, then making a subclass that has the nice syntatic sugar is easy. So, there are two issues really. 1) How to think about the data-type of a general ndarray object in order to support nested records in a straightforward way. 2) What should the syntatic sugar of a record array subclass be... I suppose a third is 3) How much of the syntatic sugar should be applied to all ndarray's? -Travis
I see no need to have the attribute 'field' and would like to avoid the use of strings to identify a record component. This does require that fields be named as Python identifiers but is this restriction a killer?
For a record array subclass that may be true. But, as was raised by others in the previous thread, there are problems of "name-space" collision with the methods and attributes of the array that would prevent certain names from being used (and any additions to the methods and attributes of the array would cause incompatibilities with some-people's records). But, at this point, I like the readability of the attribute access approach and could support it. -Travis
participants (6)
-
Andrew Straw
-
Colin J. Williams
-
Perry Greenfield
-
Stephen Waterbury
-
Travis Oliphant
-
Travis Oliphant