Proposed record array behavior: the rest of the story
We now turn to the behavior of Records. We'll note that many of the current proposals had been considered in the past but not implemented with more of a 'wait and see' attitude towards what was really necessary and a desire to prevent too many ways of doing the same thing without seeing that there was a real call for them. This proposal deals with the behavior of record array 'items', i.e., what we call Record objects now. The primary issues that have been raised with regard to Record behavior are summarized as follows: 1) Items should be tuples instead of Records 2) Items should be objects, but present tuple and/or dictionary consistent behavior. 3) Field (or column) names should be accessible as Record (and record array) attributes. Issue 1: Should record array items be tuples instead of Records? Francesc Alted made this suggestion recently. Essentially the argument is that tuples are a natural way of representing records. Unfortunately, tuples do not provide a means of accessing fields of a record by name, but only by number. For this reason alone, tuples don't appear to be adequate. Francesc proposed allowing dictionary-like indexing to record arrays to facilitate the field access to tuple entries by name. However, it seems that if rarr is a record array, that both rarr['column 1'][2] and rarr[2]['column 1'] should work, not just the former. So the short answer is "No". It should be noted that using tuples will force another change in current behavior. Note that the current Record objects are actually views into the record array. Changing the value within a record object changes the record array. Use of tuples won't allow that since tuples are not mutable. Whole records must be changed in their entirety if single elements of record arrays were set by and returned from tuples. But his comments (and well as those of others) do point out a number of problems with the current implementation that could be improved, and making the Record object support tuple behaviors is quite reasonable. Hence: Issue 2: Should record array items present tuple and/or dictionary compatible behaviors? The short answer is, yes, we do agree that they should. This includes many of the proposals made including: 1) supporting all Tuple capabilities with the following differences: a) fields are mutable (unlike tuple items) so long as the assigned value is coerceable to the expected type. For example the current methods of doing so are:
cell = oneRec.field(1) oneRec.setfield(1, newValue)
This proposal would allow:
cell = oneRec[1] oneRec[1] = newValue
b) slice assignments are permitted so long as it doesn't change the size of the record (i.e., no insertion of extra items) and the items can be assigned as permitted for a. E.g., OneCell[2:4] = (3, 'abc') c) __str__ will result in a display looking like that for tuples, __repr__ will show a Record constructor
print oneRec # as is currently implemented (1.1, 2, 'abc', 3) oneRec Record((1.1, 2, 'abc', 3), formats=['1Float32', '1Int16', '1a3', '1Int32']) names=['abc', 'c2', 'xyz', 'c4'])
(note that how best to handle formats is still being thought about) 2) supporting all Dictionary capabilities with the following differences: a) keys and items are ordered. b) keys are restricted to being integers or strings only c) new keys cannot be dynamically added or deleted as for dictionaries d) no support for any other dictionary capabilities that can change the number or names of items e) __str__ will not show a result looking like a dictionary (see 1c) f) values must meet Record object required type (or be coerceable to it) For example the current
cell = onRec.field('c2') oneRec.setfield('c2', newValue)
And the proposed added indexing capability:
cell = oneRec['c2'] oneRec['c2'] = newValue
Issue 3: Field (or column) names should be accessible as Record (and record array) attributes. As much as the attribute approach has appeal for simple usage, the problems of name collisions and mismatches between acceptable field names and attribute names strikes us as it does Russell Owen as being very problematic. The technique of using a special attribute as Francesc suggests (in his case, cols) that contains the field name attributes solves the name collision problem, but not the legality issue (particularly with regard to illegal characters, it's hard to imagine easily remembered mappings between legal attribute representations and the actual field name. We are inclined to try to pass (for now anyway) on mapping fields to attributes in any way. It seems to us that indexing by name should be convenient enough, as well as fully flexible to really satisfy all needs (and is needed in any case since attributes are a clumsy way to use field access when using a variable to specify the field (yes, one can use getattr(), but it's clumsy) ******************************************* Record array behavior changes: 1) It will be possible to assign any sequence to a record array item so long as the sequence contains the right number of fields, and each item of the sequence can be coerced to what the record array expects for the corresponding field of the record. (addressing numarray feature request 928473 by Russell Owen). I.e.,
recArr[1] = (2, 3.2, 'xyz', 3)
2) One may assign a record to a record array so long as the record matches the format of the record format of the record array (current behavior). 3) Easier construction and initialization of recarrays with default field values as requested in numarray bug report 928479) 4) Support for lists of field names and formats as detailed in numarray bug report 928488. 5) Field name indexing for record arrays. It will be possible to index record arrays with a field name, i.e., if the index is a string, then what will be returned is a numarray/chararray for that column. (Note that it won't be possible to index record arrays by field number for obvious reasons). I.e. Currently
col = recArr.field('doc')
Can also be
col = recArr['abc']
But the current
col = recArr.field(1)
Cannot become
col = recArr[1]
On the other hand, it will not be permitted to mix a field index with an array index in the same brackets, e.g., rarr[10, 'column 2'] will not be supported. Allowing indexing to have two different interpretations is a bit worrying. But if record array items may be indexed in this manner, it seems natural to permit the same indexing for the record array. Mixing the two kinds of indexing in one index seems of limited usefulness in the first place and it makes inheriting the existing indexing machinery for NDArrays more complicated (any efficiency gains in avoiding the intermediate object creation by using two separate index operations will likely be offset by the slowness of handling much more complicated mixed indices). Perhaps someone can argue for why mixing field indices with array indices is important, but for now we will prohibit this mode of indexing. This does point to a possible enhancement for the field indexing, namely being able to provide the equivalent of index arrays (e.g., a list of field names) to generate a new record array with a subset of fields. Are there any other issues that should be addressed for improving record arrays?
At 12:04 PM -0400 2004-07-20, Perry Greenfield wrote:
...(a detailed summary of proposed changes to numarray record arrays)
+1 on all of it with one exception noted below. This sounds like a first-rate overhaul and is much appreciated. Will it be possible, when creating a new records array, to specify types of a record array as a list of normal numarray types? Currently one has to specify the types as a "formats" string, which is nonstandard. I'm unhappy about one proposal:
... Record array behavior changes: ... 5) Field name indexing for record arrays. It will be possible to index record arrays with a field name, i.e., if the index is a string, then what will be returned is a numarray/chararray for that column. (Note that it won't be possible to index record arrays by field number for obvious reasons).
I.e. Currently
col = recArr.field('doc')
Can also be
col = recArr['abc']
But the current
col = recArr.field(1)
Cannot become
col = recArr[1]
I think recarray[field name] is too easily confused with recarray[index] and is unnecessary. I suggest one of two solutions: - Do nothing. Make users use field(field name or index) or - Allow access to the fields via an indexable entity. Simplest for the user would be to use "field" itself: recArr.field[1] recArr.field["abc"] (i.e. field becomes an object that can be called or can be accessed via __getitem__) This could easily support index arrays (a topic you brought up and that sound appealing to me): recArr.field[index array] and it might even be practical to support: recArr.field[sequence of field indices and/or names] e.g. recArr.field[(ind 1, field name 2, ind 3...)] You asked about other issues. One that comes to mind is record arrays of record arrays. Should they be allowed? My gut reaction is yes if it's not too hard. Folks always seem to find a use for generality if it's offered. On the other hand, if it's hard, it's not worth the effort. If they are allowed, users are going to want some efficient way to get to a particular field (i.e. in one call even if the field is several recArrays deep). That could get messy. Thanks for a great posting. The improvements to record arrays sound first-rate. -- Russell
Hi, I agree that numarray team's overhaul of RecArray access modes is very good and I agree most of it. A Dimarts 20 Juliol 2004 19:14, Russell E Owen va escriure:
I think recarray[field name] is too easily confused with recarray[index] and is unnecessary.
Yeah, maybe you are right.
I suggest one of two solutions: - Do nothing. Make users use field(field name or index) or - Allow access to the fields via an indexable entity. Simplest for the user would be to use "field" itself: recArr.field[1] recArr.field["abc"] (i.e. field becomes an object that can be called or can be accessed via __getitem__)
I prefer the second one. Although I know that you don't like the __getattr__ method, the field object can be used to host one. The main advantage I see having such a __getattr__ method is that I'm very used to press TAB twice in the python console with its completion capabilities activated. It would be a very nice way of interactively discovering the fields of a RecArray object. I don't know whether this feature is used a lot or not out there, but for me is just great. I understand, however, that having to include a map to suport non-vbalid python names for field names can be quite inconvenient. Regards, -- Francesc Alted
Francesc Alted wrote:
Hi,
I agree that numarray team's overhaul of RecArray access modes is very good and I agree most of it.
A Dimarts 20 Juliol 2004 19:14, Russell E Owen va escriure:
I think recarray[field name] is too easily confused with recarray[index] and is unnecessary.
Yeah, maybe you are right.
I suggest one of two solutions: - Do nothing. Make users use field(field name or index) or - Allow access to the fields via an indexable entity. Simplest for the user would be to use "field" itself: recArr.field[1] recArr.field["abc"] (i.e. field becomes an object that can be called or can be accessed via __getitem__)
I prefer the second one. Although I know that you don't like the __getattr__ method, the field object can be used to host one. The main advantage I see having such a __getattr__ method is that I'm very used to press TAB twice in the python console with its completion capabilities activated. It would be a very nice way of interactively discovering the fields of a RecArray object. I don't know whether this feature is used a lot or not out there, but for me is just great. I understand, however, that having to include a map to suport non-vbalid python names for field names can be quite inconvenient.
Regards,
Perry's issue 3. Perhaps there is a need to separate the name or identifier of a column in a RecArray or a field in a Record from its label. The labels, for display purposes, would default to the column names. The column names would default, as at present, to the Cn form. I like the use of attributes for the column names, it avoids the problem Russell Owen mentioned above. Suppose we have a simple RecArray with the fields "name" and "age", it's much simpler to write rec.name or rec.age that rec["name"] or rec["age"]. The problems with the use of attributes, which must be Python names, are (1) they cannot have accented or special characters eg é, ç, @, & * etc. and (2) there is a danger of conflict with existing properties or attributes. My guess is that the special characters would be required primarily for display purposes. Thus, the label could meet that need. The danger of conflict could be addressed by raising an exception. There remains a possible problem where identifiers are passed on from some other system, perhaps a database. Thus, the primary identifier of a row in a RecArray would be an integer index and that of a column or field would be a standard Python identifer. Although, at times, it would be useful to be able to index the individual fields (or columns) as part of the usual indexing scheme. Thus rec[2, 3, 4] could identify a record and rec[2, 3, 4].age or rec[2, 3, 4, 5] could identify the sixth field in that record. The use of attributes raises the possibility that one could have nested records. For example, suppose one has an address record: addressRecord streetNumber streetName postalCode ... There could then be a personal record: personRecord ... officeAddress homeAddress ... One could address a component as rec.homeAddress.postalCode. Finally, there was mention, earlier in the discussion, of facilitating the indexing of a RecArray. I hope that some way will be found to do this. Colin W.
I'll try to see if I can address all the comments raised (please let me know if I missed something). 1) Russell Owen asked that indexing by field name not be permitted for record arrays and at least one other agreed. Since it is easier to add something like this later rather than take it away, I'll go along with that. So while it will be possible to index a Record by field name, it won't be for record arrays. 2) Russell asked if it would be possible to specify the types of the fields using numarray/chararray type objects. Yes, it will. We will adopt Rick White's 2nd suggestion for handling fields that themselves are arrays, I.e., formats = (3,Int16), ((4,5), Float32) For a 1-d Int16 cell of shape (3,) and a 2-d Float32 cell of shape (4,5) The first suggestion ("formats = 3*(Int16,), 4*(5*(Float32,),)") will not be supported. While it is very suggestive, it does allow for inconsistent nestings that must be checked and rejected (what if someone supplies (Int16, Int16, Float32) as one of the fields?) which complicates the code. It doesn't read as well. 3) Russell also suggested nesting record arrays. This sort of capability is not being ruled out, but there isn't a chance we can devote resources to this any time soon (can anyone else?) 4) To address the suggestions of Russell and Francesc, I'm proposing that the current "field" method now become an object (callable to retain backward compatibility) that supports: a) indexing by name or number (just like Records) b) name to attribute mapping (with restrictions). So that this means 3 ways to do things! As far as attribute access goes, I simply do not want to throw arbitrary attributes into the main object itself. The use of field is comparatively clean since it has not other public attributes. Aside from mapping '_' into spaces, no other illegal attribute characters will be mapped. (The identifier/label suggestion by Colin Williams has some merit, but on the whole, I think it brings more baggage than benefit). The mapping algorithm is such that it tries to map the attribute to any field name that has either a ' ' or '_' in the place of '_' in the attribute name. While all '_' in the name will take precedence over any other match, there will be no guaranteed order for other cases (e.g., 'x_y z' vs 'x y_z' vs 'x y z'; though 'x_y_z' would be guaranteed to be selected for field.x_y_z if present) Note that the only real need to support indexing other than consistency is to support slices. Only slices for numerical indexing will be supported (and not initially). The callable syntax can support index arrays just as easily. To summarize Rarr.field.home_address Rarr.field['home address'] Rarr.field('home address') Will all work for a field named "home address" ************************************************ Any comments on these changes to the proposal? Are there those that are opposed to supporting attribute access? Thanks, Perry
At 11:43 AM -0400 2004-07-26, Perry Greenfield wrote:
I'll try to see if I can address all the comments raised (please let me know if I missed something). ...(nice proposal elided)... Any comments on these changes to the proposal? Are there those that are opposed to supporting attribute access?
Overall this sounds great. However, I am still strongly against attribute access. Attributes are usually meant for names that are intrinsic to the design of an object, not to the user's "configuration" of the object. The name mapping proposal isn't bad (thank you for keeping it simple!), but it still feels like a kludge and it adds unnecessary clutter. Your explanation of this limitations was clear, but still, imagine putting that into the manual. It's a lot of "be careful of this" info. That's a red flag to me. Imagine all the folks who don't read carefully. Also imagine those who consider attribute access "the right way to do it" and so want to clean up the limitations. I think you'll see a steady stream of: "why can't I see my field..." "why can't you solve the collision problems" "why can't I use special character thus and so" I personally feel that when a feature is hard to document or adds strange limitations then it probably suggests a flawed design. In this case there is another mechanism that is more natural, has no funny corner cases, and is much more powerful. Its only disadvantage is the need for typing for 4 extra characters. Saving 4 characters simply not sufficient reason to add this dubious feature. Before implementing attribute access I have two suggestions (which can be taken singly or together): - Postpone the decision until after the rest of the proposal is implemented. See if folks are happy with the mechanisms that are available. I freely confess to hoping that momentum will then kill the idea. - Discuss it on comp.lang.py. I'd like to see it aired more widely before being adopted. So far I've seen just a few voices for it and a few others against it. I realize it's not a democracy -- those who write the code get the final say. I also realize some folks will always want it, but that tension between simplicity and expressiveness is intrinsic to any language. If you add everything anybody wants you get a mess, and I want to avoid this mess while we still can. I hope nobody takes offense. I certainly did not mean to imply that those who wish attribute access are inferior in any way. There are features of python I wish it had that will never occur. I honestly can see the appeal of attributes; I was in favor of them myself, early on. It adds an appealing expressiveness that makes some kind of code read more naturally. But I personally feel it has too many limitations and is unnecessary. Regards, -- Russell
Russell E Owen wrote:
At 11:43 AM -0400 2004-07-26, Perry Greenfield wrote:
I'll try to see if I can address all the comments raised (please let me know if I missed something). ...(nice proposal elided)... Any comments on these changes to the proposal? Are there those that are opposed to supporting attribute access?
Overall this sounds great.
However, I am still strongly against attribute access.
Attributes are usually meant for names that are intrinsic to the design of an object, not to the user's "configuration" of the object. The name mapping proposal isn't bad (thank you for keeping it simple!), but it still feels like a kludge and it adds unnecessary clutter.
Your explanation of this limitations was clear, but still, imagine putting that into the manual. It's a lot of "be careful of this" info. That's a red flag to me. Imagine all the folks who don't read carefully. Also imagine those who consider attribute access "the right way to do it" and so want to clean up the limitations. I think you'll see a steady stream of: "why can't I see my field..." "why can't you solve the collision problems" "why can't I use special character thus and so"
I personally feel that when a feature is hard to document or adds strange limitations then it probably suggests a flawed design.
In this case there is another mechanism that is more natural, has no funny corner cases, and is much more powerful. Its only disadvantage is the need for typing for 4 extra characters. Saving 4 characters simply not sufficient reason to add this dubious feature.
Before implementing attribute access I have two suggestions (which can be taken singly or together): - Postpone the decision until after the rest of the proposal is implemented. See if folks are happy with the mechanisms that are available. I freely confess to hoping that momentum will then kill the idea. - Discuss it on comp.lang.py. I'd like to see it aired more widely before being adopted. So far I've seen just a few voices for it and a few others against it. I realize it's not a democracy -- those who write the code get the final say. I also realize some folks will always want it, but that tension between simplicity and expressiveness is intrinsic to any language. If you add everything anybody wants you get a mess, and I want to avoid this mess while we still can.
I hope nobody takes offense. I certainly did not mean to imply that those who wish attribute access are inferior in any way. There are features of python I wish it had that will never occur. I honestly can see the appeal of attributes; I was in favor of them myself, early on. It adds an appealing expressiveness that makes some kind of code read more naturally. But I personally feel it has too many limitations and is unnecessary.
That pretty much sums up my opinion. :) -- Paul -- Paul Barrett, PhD Space Telescope Science Institute Phone: 410-338-4475 ESS/Science Software Branch FAX: 410-338-4767 Baltimore, MD 21218
A Dilluns 26 Juliol 2004 18:38, Russell E Owen va escriure:
In this case there is another mechanism that is more natural, has no
Well, I guess that depends on what you understand as "natural". For example, for me the "natural" way is adding attributes. However, I must recognize that my point of view could be biased because this can be far more advantageous in the context of large hierarchies of objects where you should specify the complete path to go somewhere. This is typical on software to treat XML documents or any kind of hierarchical data organization system. For a relatively plain structure like RecArray I can understand that this can be regarded as unnecessary. But nevertheless, its adoption continue to sound appealling to me. Anyway, I'd be happy with any decision (regarding field attribute adoption) that would be made.
I hope nobody takes offense. I certainly did not mean to imply that
Not at all. Discussing is a good (the best?) way to learn more :) -- Francesc Alted
On Mon, 26 Jul 2004, Russell E Owen wrote:
Overall this sounds great.
However, I am still strongly against attribute access.
[...]
In this case there is another mechanism that is more natural, has no funny corner cases, and is much more powerful. Its only disadvantage is the need for typing for 4 extra characters. Saving 4 characters simply not sufficient reason to add this dubious feature.
I am sympathetic with Russell's point of view on this, but I do think there is more to gain than just typing 4 additional characters. When you read code that is using the dictionary version of attributes, you also are required to read and mentally parse those 4 additional characters. There is value to having clean, easily readable code that goes well beyond saving a little extra typing. If we didn't care about that, we'd probably all be using Perl. :-) Also, I like to use tab-completion during my interactive use of Python. I know how to make that work with attributes, even dynamically created attributes like those for record arrays. And it is really nice to be able to type <tab> and have it fill in a name or give a list of all the available columns. Doing that with the string/dictionary approach could be possible, I guess, but it is a lot trickier. So I do think there are some good reasons for wanting attribute access. Whether they are strong enough to counter Russell's sensible arguments about not cluttering up the interface and documentation, I'm not sure. My personal preference would be to get rid of the mapping between blanks and underscore and to do no mapping of any kind. Then if a column has a name that maps to a legal Python variable, you can access it with an attribute, and if it doesn't then you can't. That doesn't sound particular hard to understand or explain to me. Rick
Russell E Owen wrote:
At 11:43 AM -0400 2004-07-26, Perry Greenfield wrote:
I'll try to see if I can address all the comments raised (please let me know if I missed something). ...(nice proposal elided)... Any comments on these changes to the proposal? Are there those that are opposed to supporting attribute access?
Overall this sounds great.
However, I am still strongly against attribute access.
Attributes are usually meant for names that are intrinsic to the design of an object, not to the user's "configuration" of the object.
Russell, I hope that you will elaborate this distinction between design and usage. On the face of it, I would have though that the two should be closely related.
The name mapping proposal isn't bad (thank you for keeping it simple!), but it still feels like a kludge and it adds unnecessary clutter.
Your explanation of this limitations was clear, but still, imagine putting that into the manual. It's a lot of "be careful of this" info. That's a red flag to me. Imagine all the folks who don't read carefully. Also imagine those who consider attribute access "the right way to do it" and so want to clean up the limitations. I think you'll see a steady stream of: "why can't I see my field..." "why can't you solve the collision problems" "why can't I use special character thus and so"
I personally feel that when a feature is hard to document or adds strange limitations then it probably suggests a flawed design.
In this case there is another mechanism that is more natural, has no funny corner cases, and is much more powerful. Its only disadvantage is the need for typing for 4 extra characters. Saving 4 characters simply not sufficient reason to add this dubious feature.
Before implementing attribute access I have two suggestions (which can be taken singly or together): - Postpone the decision until after the rest of the proposal is implemented. See if folks are happy with the mechanisms that are available. I freely confess to hoping that momentum will then kill the idea. - Discuss it on comp.lang.py. I'd like to see it aired more widely before being adopted. So far I've seen just a few voices for it and a few others against it. I realize it's not a democracy -- those who write the code get the final say. I also realize some folks will always want it, but that tension between simplicity and expressiveness is intrinsic to any language. If you add everything anybody wants you get a mess, and I want to avoid this mess while we still can.
There is merit to this suggestion. It would expose the proposal to other expeiences.
I hope nobody takes offense. I certainly did not mean to imply that those who wish attribute access are inferior in any way. There are features of python I wish it had that will never occur. I honestly can see the appeal of attributes; I was in favor of them myself, early on. It adds an appealing expressiveness that makes some kind of code read more naturally. But I personally feel it has too many limitations and is unnecessary.
Regards,
-- Russell
Perry Greefield summarized: Rarr.field.home_address Rarr.field['home address'] Rarr.field('home address') Will all work for a field named "home address" This is good, it gives the desired functionality. One minor suggestion. We have Rarr.X.home_address, I believe that, in earlier posting, someone suggested that X.home_address really identifies a column rather than a field. Suppose that home_address is field number 6 in the record, Would Rarr.field[6] be equivalent to the above? This may appear redundant, but it gives a method for selecting a group of columns, eg. Rarr.field[6:9] Finally, would Rarr.field.home_address.city or Rarr.field.work_address.city be legitimate? As Russell Owen pointed out, at the end of the day Perry Greenfield will use his judgement as to the best arrangement and we will all live with it. Colin W,
------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
At 5:41 PM -0400 2004-07-26, Colin J. Williams wrote:
Russell E Owen wrote:
At 11:43 AM -0400 2004-07-26, Perry Greenfield wrote:
I'll try to see if I can address all the comments raised (please let me know if I missed something). ...(nice proposal elided)... Any comments on these changes to the proposal? Are there those that are opposed to supporting attribute access?
Overall this sounds great.
However, I am still strongly against attribute access.
Attributes are usually meant for names that are intrinsic to the design of an object, not to the user's "configuration" of the object.
Russell, I hope that you will elaborate this distinction between design and usage. On the face of it, I would have though that the two should be closely related.
To my mind, the design of an object describes the intended behavior of the object: what kind of data can it deal with and what should it do to that data. It tends to be "static" in the sense that it is not a function of how the object is created or what data is contained in the object. The design of the object usually drives the choice of the attributes of the object (variables and methods). On the other hand, the user's "configuration" of the object is what the user has done to make a particular instance of an object unique -- the data the user has been loaded into the object. I consider the particular named fields of a record array to fall into the latter category. But it is a gray area. Somebody else might argue that the record array constructors is an object factory, turning out an object designed by the user. From that alternative perspective, adding attributes to represent field names is perhaps more natural as a design. I think the main issues are: - Are there too many ways to address things? (I say yes) - Field name mapping: there is no trivial 1:1 mapping between valid field names and valid attribute names. - Nested access. Not sure about this one, but I'd like to hear more. If we do end up with attributes for field names, I really like Rick White's suggestion of adding an attribute for a field only if the field name is already a valid attribute name. That neatly avoids the collision issue and is simple to document. -- Russell
Russell E Owen wrote:
At 5:41 PM -0400 2004-07-26, Colin J. Williams wrote:
Russell E Owen wrote:
At 11:43 AM -0400 2004-07-26, Perry Greenfield wrote:
I'll try to see if I can address all the comments raised (please let me know if I missed something). ...(nice proposal elided)... Any comments on these changes to the proposal? Are there those that are opposed to supporting attribute access?
Overall this sounds great.
However, I am still strongly against attribute access.
Attributes are usually meant for names that are intrinsic to the design of an object, not to the user's "configuration" of the object.
Russell, I hope that you will elaborate this distinction between design and usage. On the face of it, I would have though that the two should be closely related.
To my mind, the design of an object describes the intended behavior of the object: what kind of data can it deal with and what should it do to that data. It tends to be "static" in the sense that it is not a function of how the object is created or what data is contained in the object. The design of the object usually drives the choice of the attributes of the object (variables and methods).
On the other hand, the user's "configuration" of the object is what the user has done to make a particular instance of an object unique -- the data the user has been loaded into the object.
I consider the particular named fields of a record array to fall into the latter category. But it is a gray area. Somebody else might argue that the record array constructors is an object factory, turning out an object designed by the user. From that alternative perspective, adding attributes to represent field names is perhaps more natural as a design.
I think the main issues are: - Are there too many ways to address things? (I say yes)
This could be true. I guess the test is whether there is a rational justification for each way.
- Field name mapping: there is no trivial 1:1 mapping between valid field names and valid attribute names.
If one starts with the assumption that field/attribute names are compatible with Python names, then I don't see that this is a problem. The question has been raised as to whether a wider range of names should be permitted e.g.. including such characters as ~`()!çéë. My view is that such characters should be considered acceptable for data labels, but not for data names. i.e. they are for display, not for manipulation.
- Nested access. Not sure about this one, but I'd like to hear more.
A RecArray is made of of a number of records, each of the same length and data configuration. Each field of a record is of fixed length and type. It wouldn't be a big leap to permit another record in one of the fields. Suppose we have an address record aRec and a personnel record pRec and that rArr is an array of pRec. aRec street: a30 city:a20 postalCode: a7 pRec id: i4 firstName: a15 lastName: a20 homeAddress: aRec workAddress: aRec Then rArr[16].homeAddress.city could give us the hime city for person 16 in rArr
If we do end up with attributes for field names, I really like Rick White's suggestion of adding an attribute for a field only if the field name is already a valid attribute name. That neatly avoids the collision issue and is simple to document.
-- Russell
Best wishes, Colin W.
A Dimarts 27 Juliol 2004 20:21, Colin J. Williams va escriure:
If one starts with the assumption that field/attribute names are compatible with Python names, then I don't see that this is a problem. The question has been raised as to whether a wider range of names should be permitted e.g.. including such characters as ~`()!çéë. My view is that such characters should be considered acceptable for data labels, but not for data names. i.e. they are for display, not for manipulation.
I finally was able to see your point. You mean that naming a field with a non-python identifier would be forbidden, and provide another attribute (like 'title', for example) in case the user wants to add some kind of data label. Kind of: records.array([...], names=["c1","c2","c3"], titles=["F one","time&dime","çò"]) and have a new attribute called "titles" that keeps this info. Well, I think that would be a very nice solution IMO. -- Francesc Alted
On Tue, 27 Jul 2004 20:46:52 +0200, Francesc Alted wrote
A Dimarts 27 Juliol 2004 20:21, Colin J. Williams va escriure:
If one starts with the assumption that field/attribute names are compatible with Python names, then I don't see that this is a problem. The question has been raised as to whether a wider range of names should be permitted e.g.. including such characters as ~`()!çéë. My view is that such characters should be considered acceptable for data labels, but not for data names. i.e. they are for display, not for manipulation.
I finally was able to see your point. You mean that naming a field with a non-python identifier would be forbidden, and provide another attribute (like 'title', for example) in case the user wants to add some kind of data label. Kind of:
records.array([...], names=["c1","c2","c3"], titles=["F one", "time&dime","çò"])
and have a new attribute called "titles" that keeps this info.
Well, I think that would be a very nice solution IMO.
I agree with Rick, Colin and Francesc on this point: symbolic names are important and I like the commandline completion too. However, I have another concern: Introducing recordArray["column"] as an alternative for recordArray.field("column") breaks a symmetry between for instance 1-d record arrays and 2-d normal arrays. (the symmetry is strongly suggested by their representation: a record array prints almost as a list of tuples and a 2-d normal array almost as a list of lists). Indexing a column of a 2-d normal array is done by normalArray[:, column], so why not recArray[:, "column"] ? It removes the ambiguity between indexing with integers and with strings. Also, leaving the indices in 'natural' order becomes especially important when one envisages (record) arrays containing (record) arrays containing .... I understand that this seems to open the door to recArray[32, "column"], but if it is really not feasible to mix integers and strings (or attribute names) as indices, I prefer to use recordArray.column[32] and/or recordArray[32].column rather than recordArray["column"][32]. Even indexing with integers only seems more natural to me than eg. recordArray["column"][32], sincy I can always do: column = 7 recordArray[32, column] Regards -- Gerard
A Dimarts 27 Juliol 2004 22:04, gerard.vermeulen@grenoble.cnrs.fr va escriure:
Introducing recordArray["column"] as an alternative for recordArray.field("column") breaks a symmetry between for instance 1-d record arrays and 2-d normal arrays. (the symmetry is strongly suggested by their representation: a record array prints almost as a list of tuples and a 2-d normal array almost as a list of lists).
Indexing a column of a 2-d normal array is done by normalArray[:, column], so why not recArray[:, "column"] ?
Well, I must recognize that this has its beauty (by revealing the simmetry that you mentioned). However, mixing integer and strings on indices can be, in my opinion, rather confusing for most people. Then, I guess that the implementation wouldn't be easy.
I prefer to use
recordArray.column[32]
and/or
recordArray[32].column
rather than recordArray["column"][32].
I would prefer better: recordArray.fields.column[32] or recordArray.cols.column[32] (note the use of the plural in fields and cols, which I think is more consistent about its functionality) The problem with: recordArray[32].fields.column is that I don't see it as natural and besides, completion capabilities would be broken after the [] parenthesis. Anyway, as Russell suggested, I don't like recordArray["column"][32], because it would be unnecessary (you can get same result using recordArray[column_idx][32]). Although I recognize that a recordArray.cols["column"][32] would not hurt my eyes so much. This is because although indices continues to mix ints and strings, the difference is that ".cols" is placed first, giving a new (and unmistakable) meaning to the "column" index. Cheers, -- Francesc Alted
On Wed, 28 Jul 2004 12:00:40 +0200 Francesc Alted <falted@pytables.org> wrote:
A Dimarts 27 Juliol 2004 22:04, gerard.vermeulen@grenoble.cnrs.fr va escriure:
Introducing recordArray["column"] as an alternative for recordArray.field("column") breaks a symmetry between for instance 1-d record arrays and 2-d normal arrays. (the symmetry is strongly suggested by their representation: a record array prints almost as a list of tuples and a 2-d normal array almost as a list of lists).
Indexing a column of a 2-d normal array is done by normalArray[:, column], so why not recArray[:, "column"] ?
Well, I must recognize that this has its beauty (by revealing the simmetry that you mentioned). However, mixing integer and strings on indices can be, in my opinion, rather confusing for most people. Then, I guess that the implementation wouldn't be easy.
I prefer to use
recordArray.column[32]
and/or
recordArray[32].column
rather than recordArray["column"][32].
I would prefer better:
recordArray.fields.column[32]
or
recordArray.cols.column[32]
(note the use of the plural in fields and cols, which I think is more consistent about its functionality)
The problem with:
recordArray[32].fields.column
is that I don't see it as natural and besides, completion capabilities would be broken after the [] parenthesis.
Two points: 1. This is true for vanilla Python but not for IPython-0.6.2: packer@zombie:~> ipython Python 2.3+ (#1, Jan 7 2004, 09:17:35) Type "copyright", "credits" or "license" for more information. IPython 0.6.2 -- An enhanced Interactive Python. ? -> Introduction to IPython's features. @magic -> Information about IPython's 'magic' @ functions. help -> Python's own help system. object? -> Details about 'object'. ?object also works, ?? prints more. In [1]: d = {'Francesc': 0} In [2]: d['Francesc'].__a d['Francesc'].__abs__ d['Francesc'].__add__ d['Francesc'].__and__ In [2]: d['Francesc'].__a You see, the completion mechanism of ipython recognizes d['Francesc'] as an integer. 2. If one accepts that a "field_name" can be used as an attribute, one must be able to say: record.field_name ( == record.field("field_name") ) and (since recordArray[32] returns a record) also: recordArray[32].field_name and not recordArray[32].cols.field_name (sorry, I abhor this)
Anyway, as Russell suggested, I don't like recordArray["column"][32], because it would be unnecessary (you can get same result using recordArray[column_idx][32]).
Thank you for this little slip, you mean recordArray["column"][32] is recordArray[32][column_idx], isn't it?
Although I recognize that a recordArray.cols["column"][32] would not hurt my eyes so much. This is because although indices continues to mix ints and strings, the difference is that ".cols" is placed first, giving a new (and unmistakable) meaning to the "column" index.
I am just worried that future generalization of indexing will be impossible if the meaning of an indexing operation ("get row" or "get column or field") depends on the fact that an index is a string or an integer: IMO the meaning should depend on the position in the index list. The example has been choosen to show that I don't mind indexing by strings at all. If I see array[13, 'ab', 31, 'ba'], I know that 'ab' and 'ba' index record fields as long as the indices are in 'normal' order. Nevertheless, I am aware that Utopia may be hard to implement efficiently, but this reflects my mental picture of nested (record) arrays. (ipython in Utopia would me allow to figure out array[13].ab[31].ba by tab completion and I would translate this to array[13, 'ab', 31, 'ba'] for efficiency in a real program) I think that we agree that recordArray.cols["column"] is better than recordArray["column"], but I don't see why recordArray.cols["column"] is better than the original recordArray.field("column"). Cheers -- Gerard PS: after reading the above, there may be a case to accept only indexing which can be read from left to right, so recordArray[32].field_name is OK, but recordArray.field_name[32] is not.
A Dimecres 28 Juliol 2004 15:59, Gerard Vermeulen va escriure:
Two points:
1. This is true for vanilla Python but not for IPython-0.6.2: You see, the completion mechanism of ipython recognizes d['Francesc'] as an integer.
Ok. That's nice. IPython is more powerful than I realized :)
2. If one accepts that a "field_name" can be used as an attribute, one must be able to say:
record.field_name ( == record.field("field_name") )
and (since recordArray[32] returns a record) also:
recordArray[32].field_name
and not
recordArray[32].cols.field_name (sorry, I abhor this)
Mmm, maybe are you suggesting that the records.Record class had all its methods starting by a reserved prefix (like "_" or better, "_v_" for attrs and "_f_" for methods), and forbid that field names would start by these prefixes so that no collision problems would occur with field names?. Well, in such a case adopting this convention for records.Record objects would be far more feasible than doing the same for records.RecArray objects just because the former has very few attrs and methods. I think it's a good idea overall.
Anyway, as Russell suggested, I don't like recordArray["column"][32], because it would be unnecessary (you can get same result using recordArray[column_idx][32]).
Thank you for this little slip, you mean recordArray["column"][32] is recordArray[32][column_idx], isn't it?
Uh, my bad. I was (badly) trying to express the same than Russell Owen on a message dated from 20th July: """ I think recarray[field name] is too easily confused with recarray[index] and is unnecessary. """
I think that we agree that recordArray.cols["column"] is better than recordArray["column"], but I don't see why recordArray.cols["column"] is better than the original recordArray.field("column").
Good question. Me neither. You are proposing just keeping recordArray.cols.column as the only way to access columns?
PS: after reading the above, there may be a case to accept only indexing which can be read from left to right, so recordArray[32].field_name is OK, but recordArray.field_name[32] is not.
Sorry, I don't see the point here (it is most probably my fault given the hours I'm writing this :(. May you elaborate that? Cheers, -- Francesc Alted
Hi, Perry, your last proposal sounds good to me. Just a couple of comments. A Dilluns 26 Juliol 2004 17:43, Perry Greenfield va escriure:
4) To address the suggestions of Russell and Francesc, I'm proposing that the current "field" method now become an object (callable to retain backward compatibility) that supports: a) indexing by name or number (just like Records) b) name to attribute mapping (with restrictions). So that this means 3 ways to do things! As far as attribute access goes, I simply do not want to throw arbitrary attributes into the main object itself. The use of field is comparatively clean since it has not other public attributes. Aside from mapping '_' into spaces, no other illegal attribute characters will be mapped. (The identifier/label suggestion by Colin Williams has some merit, but on the whole, I think it brings more baggage than benefit). The mapping algorithm is such that it tries to map the attribute to any field name that has either a ' ' or '_' in the place of '_' in the attribute name. While all '_' in the name will take precedence over any other match, there will be no guaranteed order for other cases (e.g., 'x_y z' vs 'x y_z' vs 'x y z'; though 'x_y_z' would be guaranteed to be selected for field.x_y_z if present)
I guess that this mapping algorithm is weak enough to create some problems with special chars that are not suported. I'd prefer the dictionary/tuple of pairs mechanism in order to create a user-configured translation. I don't see the problem that Perry mentioned in an earlier message related with guarantying the persistence of such an object: we always have pickle, isn't it? or I'm missing something?
To summarize
Rarr.field.home_address Rarr.field['home address'] Rarr.field('home address')
Supporting Rarr.field['home address'] and Rarr.field('home address') at the same time sounds unnecessary to me. Moreover having a Rarr.field('home_address')[32] (for example) looks a bit strange, and I think Rarr.field['home_address'][32] would be better. But I repeat, this is my personal feeling. I know that dropping support of __call__() in field will make the change backward incompatible, but perhaps now is a good time to define a better interface to the RecArray object. Another possibility maybe to raise a deprecation warning for such an use for a couple of releases. Regards, -- Francesc Alted
At 8:11 PM +0200 2004-07-26, Francesc Alted wrote:
... Supporting Rarr.field['home address'] and Rarr.field('home address') at the same time sounds unnecessary to me. Moreover having a Rarr.field('home_address')[32] (for example) looks a bit strange, and I think Rarr.field['home_address'][32] would be better. But I repeat, this is my personal feeling.
I know that dropping support of __call__() in field will make the change backward incompatible, but perhaps now is a good time to define a better interface to the RecArray object. Another possibility maybe to raise a deprecation warning for such an use for a couple of releases.
I completely agree. -- Russell
I guess I've seen enough discussion to try to refine the last delta into what is the last (or next to last) version: So here are the changes to the last updated proposal: 1) I originally intended to narrow attribute access to strictly legal names as Rick White suggested but something got into me to try to handle spaces. I agree with Rick on this. I see that as a very simple rule to remember and don't see it as confusing to allow this. 2) Attribute access still won't be permitted directly on record arrays or records. I'm very much in agreement with Francesc that "fields" is more suggestive than "field" as to the record and record array object that permits both indexing and attribute access by name. The use of the field method will remain, but will eventually be deprecated. As to other names, namely cols, I'll stick with fields since it started with that usage, and that field is a more appropriate term when dealing with multidimensional record arrays (columns is much more suggestive of simple tables). Non changes: 3) It will not be possible to index record arrays by column name. So Rarr["column 1"] will not be permitted, but Rarr.fields["column 1"] will. Nor will Rarr[32, "column 1"] be permitted. 4) As for optional labels (for display purposes) I'd like to hold off. I would like to have only one way to associate a name with a field and until it is clearer what extra record array functionality would be associated with labels, I'd rather not include them. Even then, I'm not sure I want to see too much more dragged in (e.g., units, display formats, etc.) These sorts of things may be more appropriate for a subclass. I realize that no single person will be happy with these choices, but they seem to me to be the best compromise without unduly complicating things, restricting future enhancements, and being to hard to implement. Has anything fallen into a crack? So what follows is a updated version of what I last sent out: ****************************************************************** 1) Russell Owen asked that indexing by field name not be permitted for record arrays and at least one other agreed. Since it is easier to add something like this later rather than take it away, I'll go along with that. So while it will be possible to index a Record by field name, it won't be for record arrays. 2) Russell asked if it would be possible to specify the types of the fields using numarray/chararray type objects. Yes, it will. We will adopt Rick White's 2nd suggestion for handling fields that themselves are arrays, I.e., formats = (3,Int16), ((4,5), Float32) For a 1-d Int16 cell of shape (3,) and a 2-d Float32 cell of shape (4,5) The first suggestion ("formats = 3*(Int16,), 4*(5*(Float32,),)") will not be supported. While it is very suggestive, it does allow for inconsistent nestings that must be checked and rejected (what if someone supplies (Int16, Int16, Float32) as one of the fields?) which complicates the code. It doesn't read as well. 3) Russell also suggested nesting record arrays. This sort of capability is not being ruled out, but there isn't a chance we can devote resources to this any time soon (can anyone else?) 4) To address the suggestions of Russell and Francesc, I'm proposing that a new attribute "fields" bed added that allows: a) indexing by name or number (just like Records) b) name as attributes so long as the name is allowable as a legal attribute. No attempt will be made to map names that are not legal attribute strings into a different attribute name. The field method will remain and be eventually deprecated. Note that the only real need to support indexing other than consistency is to support slices. Only slices for numerical indexing will be supported (and not initially). The callable syntax can support index arrays just as easily. To summarize Rarr.fields['home address'] Rarr.field('home address') Will all work for a field named "home address" but this field cannot be specified as an attribute of Rarr.fields If there is a field named "intensity" then Rarr.fields.intensity Will be permitted.
Hi Perry, Well, after the bunch of messages talking about an *apparently* silly question, I must say that I mostly agree with your last proposal. The only thing that I strongly miss is that you are not decided to include the "titles" parameter to the constructor and the respective attribute. In my opinion, this would allow to forbid declaring illegal names as field names and provide full access to all attributes in *all* the ways you proposed. I think this is another kind of metainformation than just units, display formats, etc. A "titles" atttribute is about providing functionality, not just adding information. But, as you said, there will be always somebody not completely satisfied ;) Anyway, thanks for listening to all of us and put some good sense in all the mess that provoked the discussion. Cheers, -- Francesc Alted
participants (8)
-
Colin J. Williams
-
Francesc Alted
-
Gerard Vermeulen
-
gerard.vermeulen@grenoble.cnrs.fr
-
Paul Barrett
-
Perry Greenfield
-
Rick White
-
Russell E Owen