multidimensional record arrays

There have been a number of questions and suggestions about how the record array facility in numarray could be improved. We've been talking about these internally and thought it would be useful to air some proposals along with discussions of the rationale behind each proposal as well discussions of drawbacks, and some remaining open questions. Rather than do this in one long message, we will do this in pieces. The first addresses how to improve handling multidimensional record arrays. These will not discuss how or when we implement the proposed enhancements or changes. We first want to come to some consensus (or lacking that, decision) first about what the target should be. ********************************************************* Proposal for records module enhancement, to handle record arrays of dimension (rank) higher than 1. Background: The current records module in numarray doesn't handle record arrays of dimension higher than one well. Even though most of the infrastructure for higher dimensionality is already in place, the current implementation for the record arrays was based on the implicit assumption that record arrays are 1-D. This limitation is reflected in the areas of input user interface, indexing, and output. The indexing and output are more straightforward to modify, so I'll discuss it first. Although it is possible to create a multi-dimensional record array, indexing does not work properly for 2 or more dimensions. For example, for a 2-D record array r, r[i,j] does not give correct result (but r[i][j] does). This will be fixed. At present, a user cannot print record arrays higher than 1-D. This will also be fixed as well as incorporating some numarray features (e.g., printing only the beginning and end of an array for large arrays--as is done for numarrays now). Input Interface: There are currently several different ways to construct the record array using the array() function These include setting the buffer argument to: (1) None (2) File object (3) String object or appropriate buffer object (i.e., binary data) (4) a list of records (in the form of sequences), for example: [(1,'abc', 2.3), (2,'xyz', 2.4)] (5) a list of numarrays/chararrays for each field (e.g., effectively 'zipping' the arrays into records) The first three types of input are very general and can be used to generate multi-dimensional record arrays in the current implementation. All these options need to specify the "shape" argument. The input options that do not work for multi-dimensional record arrays now are the last two. Option 4 (sequence of 'records') If a user has a multi-dimensional record array and if one or more field is also a multidimensional array, using this option is potentially confusing since there can be ambiguity regarding what part of a nested sequence structure is the structure of the record array and what should be considered part of the record since record elements themselves may be arrays. (Some of the same issues arise for object arrays) As an example: --> r=rec.array([([1,2],[3,4]),([11,12],[13,14])]) could be interpreted as a 1-D record array, where each cell is an (num)array: RecArray[ (array([1, 2]), array([3, 4])), (array([11, 12]), array([13, 14])) ] or a 2-D record array, where each cell is just a number: RecArray( [[(1, 2), (3, 4)], [(11, 12), (13, 14)]]) Thus we propose a new argument "rank" (following the convention used in object arrays) to specify the dimensionality of the output record array. In the first example above, rank is 1, and the second example rank=2. If rank is set to None, the highest possible rank will be assumed (in this example, 2). We propose to eventually generalize that to accept any sequence object for the array structure (though there will be the same requirement that exist for other arrays that the nested sequences be of the same type). As would be expected, strings are not permitted as the enclosing sequence. In this future implementation the record 'item' itself must either be: 1) A tuple 2) A subclass of tuple 3) A Record object (this may be taken care of by 2 if we make Record a subclass of tuple; this will be discussed in a subsequent proposal. This requirement allows distinguishing the sequence of records from Option 5 below. For tuples (or tuple derived elements), the items of the tuple must be one of the following: basic data types such as int, float, boolean, or string; a numarray or chararray; or an object that can be converted to a numarray or chararray. Option 5 (List of Arrays) Using a list of arrays to construct an N-D record array should be easier Than using the previous option. The input syntax is simply: [array1, array2, array3,...] The shape of the record array will be determined from the shape of the input arrays as described below. All the user needs to do is to construct the arrays in the list. There is, similar to option 4, a possible ambiguity: if all the arrays are of the shape, say, (2,3), then the user may intend a 1-D record array of 2 rows while each cell is an array of shape 3, or a 2-D record array of shape (2,3) while each cell is a single number of string. Thus, the user must either explicitly specify the "shape" or "rank". We propose the following behavior via examples: Example 1: given: array1.shape=(2,3,4,5) array2.shape=(2,3,4) array3.shape=(2,3) Rank can only be specified as rank=1 (the record array's shape will then be (2,)) or rank=2 (the record array's shape will then be (2,3)). For rank=None the record shape will be (2,3), i.e. the "highest common denominator": each cell in the first field will be an array of shape (4,5), each cell in the second field will be an array of shape (4,), and each cell in the 3rd field will be a single number or a string. If "shape" is specified, it will take precedence over "rank" and its allowed value in this example will be either 2, or (2,3). Example 2: array1.shape=(3,4,5) array2.shape=(4,5) this will raise exception because the 'slowest' axes do not match. ********* For both the sequence of records and list-of-arrays input options, we Propose the default value for "rank" be None (current default is 1). This gives consistent behavior with object arrays but does change the current behavior. Also for both cases specifying a shape inconsistent with the supplied data will raise an exception.
participants (1)
-
Jin-chung Hsu