Hello all, Documentation of recarrays is poor and I'd like to improve it. In order to do this I've been looking at core/records.py, and I would appreciate some feedback on my plan. Let me start by describing what I see. In the docs there is some confusion about 'structured arrays' vs 'record arrays' vs 'recarrays' - the docs use them often interchangeably. They also refer to structured dtypes alternately as 'struct data types', 'record data types' or simply 'records' (eg, see the reference/arrays.dtypes and reference/arrays.indexing doc pages). But by my reading of the code there are really three (or four) distinct types of arrays with structure. Here's a possible nomenclature: * "Structured arrays" are simply ndarrays with structured dtypes. That is, the data type is subdivided into fields of different type. * "recarrays" are a subclass of ndarrays that allow access to the fields by attribute. * "Record arrays" are recarrays where the elements have additionally been converted to 'numpy.core.records.record' type such that each data element is an object with field attributes. * (it is also possible to create arrays with dtype.dtype of numpy.core.records.record, but which are not recarrays. However I have never seen this done.) Here's code demonstrating the creation of the different types of array (in order: structured array, recarray, ???, record array). >>> arr = np.array([(1,'a'), (2,'b')], dtype=[('foo', int), ('bar', 'S1')]) >>> recarr = arr.view(type=np.recarray) >>> noname = arr.view(dtype=dtype(np.record, arr.dtype)) >>> recordarr = arr.view(dtype=dtype((np.record, arr.dtype)), type=np.recarray) >>> type(arr), arr.dtype.type (numpy.ndarray, numpy.void) >>> type(recarr), recarr.dtype.type (numpy.core.records.recarray, numpy.void) >>> type(recordarr), recordarr.dtype.type (numpy.core.records.recarray, numpy.core.records.record) Note that the functions numpy.rec.array, numpy.rec.fromrecords, numpy.rec.fromarrays, and np.recarray.__new__ create record arrays. However, in the docs you can see examples of the creation of recarrays, eg in the recarray and ndarray.view doctrings and in http://www.scipy.org/Cookbook/Recarray. The files numpy/lib/recfunctions.py and numpy/lib/npyio.py (and possibly masked arrays, but I haven't looked yet) make extensive use of recarrays (but not record arrays). The main functional difference between recarrays and record arrays is field access on individual elements: >>> recordarr[0].foo 1 >>> recarr[0].foo Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'numpy.void' object has no attribute 'foo' Also, note that recarrays have a small performance penalty relative to structured arrays, and record arrays have another one relative to recarrays because of the additional python logic. So my first goal in updating the docs is to use the right terms in the right place. In almost all cases, references to 'records' (eg 'record types') should be replaced with 'structured' (eg 'structured types'), with the exception of docs that deal specifically with record arrays. It's my guess that in the distant past structured datatypes were intended to always be of type numpy.core.records.record (thus the description in reference/arrays.dtypes) but that numpy.core.records.record became generally obsolete without updates to the docs. doc/records.rst.txt seems to document the transition. I've made a preliminary pass of the docs, which you can see here https://github.com/ahaldane/numpy/commit/d87633b228dabee2ddfe75d1ee9e41ba703... Mostly I renamed 'record type' to 'structured type', and added a very rough draft to numpy/doc/structured_arrays.py. I would love to hear from those more knowledgeable than myself on whether this works! Cheers, Allan
print repr(recarr.dtype)
print repr(recordarr.dtype)
In light of my previous message I'd like to bring up https://github.com/numpy/numpy/issues/3581, as it is now clearer to me what is happening. In the example on that page the user creates a recarray and a record array (in my nomenclature) without realizing that they are slightly different types of beast. This is probably because the str() or repr() representations of these two objects are identical. To distinguish them you have to look at their dtype.type. Using the setup from my last message: >>> print repr(recarr) rec.array([(1, 'a'), (2, 'b')], dtype=[('foo', '<i8'), ('bar', 'S1')]) >>> print repr(recordarr) rec.array([(1, 'a'), (2, 'b')], dtype=[('foo', '<i8'), ('bar', 'S1')]) >>> print repr(recarr.dtype) dtype([('foo', '<i8'), ('bar', 'S1')]) >>> print repr(recordarr.dtype) dtype([('foo', '<i8'), ('bar', 'S1')]) >>> print recarr.dtype.type <type 'numpy.void'> >>> print recordarr.dtype.type <class 'numpy.core.records.record'> Based on this, it occurs to me that the repr of a dtype should list dtype.type if it is not numpy.void. This might be nice to see: dtype([('foo', '<i8'), ('bar', 'S1')]) dtype((numpy.core.records.record, [('foo', '<i8'), ('bar', 'S1')])) I could easily implement this by redefining __repr__ for the numpy.core.records.record class, but this does not solve the problem for any other cases of overridden base_dtype. So perhaps modifications should be made to the original repr function of dtype (in the functions arraydescr_struct_str and arraydescr_struct_repr in numpy/core/src/multiarray/descriptor.c). However, also note that the doc http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html says that creating dtypes using the form dtype((base_dtype, new_dtype)) is discouraged (near the bottom). Another possibility is to discourage recarrays, and only document record arrays (or vv). However, many people's code already depends on both of these types. Is any of this at all reasonable? It would require a change to dtype str and repr, which could affect a lot of things. Cheers, Allan On 01/18/2015 11:36 PM, Allan Haldane wrote:
Hello all,
Documentation of recarrays is poor and I'd like to improve it. In order to do this I've been looking at core/records.py, and I would appreciate some feedback on my plan.
Let me start by describing what I see. In the docs there is some confusion about 'structured arrays' vs 'record arrays' vs 'recarrays' - the docs use them often interchangeably. They also refer to structured dtypes alternately as 'struct data types', 'record data types' or simply 'records' (eg, see the reference/arrays.dtypes and reference/arrays.indexing doc pages).
But by my reading of the code there are really three (or four) distinct types of arrays with structure. Here's a possible nomenclature: * "Structured arrays" are simply ndarrays with structured dtypes. That is, the data type is subdivided into fields of different type. * "recarrays" are a subclass of ndarrays that allow access to the fields by attribute. * "Record arrays" are recarrays where the elements have additionally been converted to 'numpy.core.records.record' type such that each data element is an object with field attributes. * (it is also possible to create arrays with dtype.dtype of numpy.core.records.record, but which are not recarrays. However I have never seen this done.)
Here's code demonstrating the creation of the different types of array (in order: structured array, recarray, ???, record array).
>>> arr = np.array([(1,'a'), (2,'b')], dtype=[('foo', int), ('bar', 'S1')]) >>> recarr = arr.view(type=np.recarray) >>> noname = arr.view(dtype=dtype(np.record, arr.dtype)) >>> recordarr = arr.view(dtype=dtype((np.record, arr.dtype)), type=np.recarray)
>>> type(arr), arr.dtype.type (numpy.ndarray, numpy.void) >>> type(recarr), recarr.dtype.type (numpy.core.records.recarray, numpy.void) >>> type(recordarr), recordarr.dtype.type (numpy.core.records.recarray, numpy.core.records.record)
Note that the functions numpy.rec.array, numpy.rec.fromrecords, numpy.rec.fromarrays, and np.recarray.__new__ create record arrays. However, in the docs you can see examples of the creation of recarrays, eg in the recarray and ndarray.view doctrings and in http://www.scipy.org/Cookbook/Recarray. The files numpy/lib/recfunctions.py and numpy/lib/npyio.py (and possibly masked arrays, but I haven't looked yet) make extensive use of recarrays (but not record arrays).
The main functional difference between recarrays and record arrays is field access on individual elements:
>>> recordarr[0].foo 1 >>> recarr[0].foo Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'numpy.void' object has no attribute 'foo'
Also, note that recarrays have a small performance penalty relative to structured arrays, and record arrays have another one relative to recarrays because of the additional python logic.
So my first goal in updating the docs is to use the right terms in the right place. In almost all cases, references to 'records' (eg 'record types') should be replaced with 'structured' (eg 'structured types'), with the exception of docs that deal specifically with record arrays. It's my guess that in the distant past structured datatypes were intended to always be of type numpy.core.records.record (thus the description in reference/arrays.dtypes) but that numpy.core.records.record became generally obsolete without updates to the docs. doc/records.rst.txt seems to document the transition.
I've made a preliminary pass of the docs, which you can see here https://github.com/ahaldane/numpy/commit/d87633b228dabee2ddfe75d1ee9e41ba703...
Mostly I renamed 'record type' to 'structured type', and added a very rough draft to numpy/doc/structured_arrays.py.
I would love to hear from those more knowledgeable than myself on whether this works!
Cheers, Allan
participants (1)
-
Allan Haldane