Nested recarrays with subarrays and loadtxt: a bug in loadtxt?
Howdy, I'm wondering if the code below illustrates a bug in loadtxt, or just a 'live with it' limitation. I'm inlining it for ease of discussion, but the same code is attached to ensure that anyone willing to look at this can just download and run without pasting/whitespace issues. The code is, I hope, sufficiently commented to explain the problem in full detail. Thanks for any input! Cheers, f ### """Simple illustration of nested record arrays. Note: possible numpy.loadtxt bug?""" from StringIO import StringIO import numpy as np from numpy import array, dtype, loadtxt, recarray # Consider the task of loading data that is stored in plain text in a file such # as the string below, where the last block of numbers is meant to be # interpreted as a single 2x3 int array, whose field name in the resulting # structured array will be 'block'. txtdata = StringIO(""" # name x y block - 2x3 ints aaaa 1.0 8.0 1 2 3 4 5 6 aaaa 2.0 7.4 2 11 22 3 4 5 6 bbbb 3.5 8.5 3 0 22 44 5 6 aaaa 6.4 4.0 4 1 3 33 54 65 aaaa 8.8 4.1 5 5 3 4 44 77 bbbb 5.5 9.1 6 3 4 5 0 55 bbbb 7.7 8.5 7 2 3 4 5 66 """) # We make the dtype for it: dt = dtype(dict(names=['name','x','y','block'], formats=['S4',float,float,(int,(2,3))])) # And we load it with loadtxt and make a recarray version for convenience data = loadtxt(txtdata,dt) rdata = data.view(recarray) # Unfortunately, if we look at the block data, it repeats the first number # found. This seems to be a loadtxt bug: # In [176]: rdata.block[0,1] # Out[176]: array([1, 1, 1]) # we'd expect array([4, 5, 6]) if np.any(rdata.block[0,1] != array([4, 5, 6])): print 'WARNING: loadtxt bug??' # A workaround can be used by doing a second pass on the file, loading the # columns corresponding to the block as plain ints and doing a reassignment of # that data into the original data. # Rewind the data and reload only the 'block' of ints: txtdata.seek(0) block_data = loadtxt(txtdata,int,usecols=range(3,9)) # Let's work with a copy of the original so we can compare interactively... rdata2 = rdata.copy() # We assign to the block field in our real array the block_data one, # appropriately reshaped rdata2.block[:] = block_data.reshape(rdata.block.shape) # Same check as before, with the new one if np.any(rdata2.block[0,1] != array([4, 5, 6])): print 'WARNING: loadtxt bug??' else: print 'Second pass - data loaded OK.'
On May 27, 2009, at 5:53 PM, Fernando Perez wrote:
Howdy,
I'm wondering if the code below illustrates a bug in loadtxt, or just a 'live with it' limitation.
Have you tried np.lib.io.genfromtxt ? dt = dtype(dict(names=['name','x','y','block'], formats=['S4',float,float,(int,(2,3))])) txtdata = StringIO(""" # name x y block - 2x3 ints aaaa 1.0 8.0 1 2 3 4 5 6 aaaa 2.0 7.4 2 11 22 3 4 5 6 bbbb 3.5 8.5 3 0 22 44 5 6 aaaa 6.4 4.0 4 1 3 33 54 65 aaaa 8.8 4.1 5 5 3 4 44 77 bbbb 5.5 9.1 6 3 4 5 0 55 bbbb 7.7 8.5 7 2 3 4 5 66 """) alt_data = np.lib.io.genfromtxt(txtdata,dtype=dt) array([('aaaa', 1.0, 8.0, [[1, 1, 1], [1, 1, 1]]), ('aaaa', 2.0, 7.4000000000000004, [[2, 2, 2], [2, 2, 2]]), ('bbbb', 3.5, 8.5, [[3, 3, 3], [3, 3, 3]]), ('aaaa', 6.4000000000000004, 4.0, [[4, 4, 4], [4, 4, 4]]), ('aaaa', 8.8000000000000007, 4.0999999999999996, [[5, 5, 5], [5, 5, 5]]), ('bbbb', 5.5, 9.0999999999999996, [[6, 6, 6], [6, 6, 6]]), ('bbbb', 7.7000000000000002, 8.5, [[7, 7, 7], [7, 7, 7]])], dtype=[('name', '|S4'), ('x', '<f8'), ('y', '<f8'), ('block', '<i4', (2, 3))]) Is this what you want?
Hi Pierre, On Wed, May 27, 2009 at 3:01 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
Have you tried np.lib.io.genfromtxt ?
I didn't know about it, but it has the same problem as loadtxt: In [5]: rdata.block[0,1] # incorrect Out[5]: array([1, 1, 1]) In [6]: alt_data.block[0,1] # same thing, still wrong Out[6]: array([1, 1, 1]) In [7]: rdata2.block[0,1] # with my manual workaround, this is right Out[7]: array([4, 5, 6]) The data is: # name x y block - 2x3 ints aaaa 1.0 8.0 1 2 3 4 5 6 ... so only rdata2 is correct, the others are repeating the first '1' throughout the entire block, which is the problem. Cheers, f
On May 27, 2009, at 6:15 PM, Fernando Perez wrote:
Hi Pierre,
On Wed, May 27, 2009 at 3:01 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
Have you tried np.lib.io.genfromtxt ?
I didn't know about it, but it has the same problem as loadtxt:
Oh yes indeed. Yet another case of "I-opened-my-mouth-too-soon'... OK, so there's a trick. Kinda: * Define a specific converter: def block_converter(values): # Convert the strings to int val = [int(_) for _ in values.split()] new = np.array(val, dtype=int).reshape(2,3) out = tuple([tuple(_) for _ in new]) return out * Now, make sure that the column-delimiter is set to '\t' and use the new converter data = genfromtxt(txtdata,dt, delimiter="\t", converters={3:block_converter}) That works if your second line is "aaaa 2.0 7.4 2 11 22 3 4 56" instead of "aaaa 2.0 7.4 2 11 22 3 4 5 6" (that is, if you have exactly 6 ints in the last entry, not 7). Note that youcould modify the converter to deal with that if needed.
Hi Pierre, On Wed, May 27, 2009 at 4:03 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
Oh yes indeed. Yet another case of "I-opened-my-mouth-too-soon'...
OK, so there's a trick. Kinda: * Define a specific converter:
Thanks, that's an alternative, though I think I prefer my two-pass hack, though I can't quite really say why... Cheers, f
On May 27, 2009, at 7:10 PM, Fernando Perez wrote:
Hi Pierre,
On Wed, May 27, 2009 at 4:03 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
Oh yes indeed. Yet another case of "I-opened-my-mouth-too-soon'...
OK, so there's a trick. Kinda: * Define a specific converter:
Thanks, that's an alternative, though I think I prefer my two-pass hack, though I can't quite really say why...
Funny, I prefer mine ;) Seriously: there might be some overhead in your 2-pass method that might be inconvenient. Some timing would be needed...
Hi Fernando 2009/5/27 Fernando Perez <fperez.net@gmail.com>:
I'm wondering if the code below illustrates a bug in loadtxt, or just a 'live with it' limitation.
I'm not sure whether this is a bug or not. By specifying the dtype
dt = dtype(dict(names=['name','x','y','block'], formats=['S4',float,float,(int,(2,3))]))
you are saying "column four contains 6 integers", which is a bit of a strange notion. If you want this to be interpreted as "the last 6 columns should be stored in block", then a simple modification to flatten_dtype should do the trick. Cheers Stéfan
Hi Stefan, 2009/5/27 Stéfan van der Walt <stefan@sun.ac.za>:
Hi Fernando
2009/5/27 Fernando Perez <fperez.net@gmail.com>:
I'm wondering if the code below illustrates a bug in loadtxt, or just a 'live with it' limitation.
I'm not sure whether this is a bug or not.
By specifying the dtype
dt = dtype(dict(names=['name','x','y','block'], formats=['S4',float,float,(int,(2,3))]))
you are saying "column four contains 6 integers", which is a bit of a strange notion. If you want this to be interpreted as "the last 6 columns should be stored in block", then a simple modification to flatten_dtype should do the trick.
Well, since dtypes allow for nesting full arrays in this fashion, where I can say that the 'block' field can have (2,3) shape, it seems like it would be nice to be able to express this nesting into loading of plain text files as well. The idea would be that any nested dtype like the above would be expanded out for reading purposes into columns, so that the dt spec is interpreted in the second form you provided. So I'd give it a mild +0.5 for this modification if it's indeed easy, since it seems to make loadtxt more convenient to use for this class of uses. But if people feel it's stretching things too far, there's always either the two-pass hack I used or the custom converter Pierre suggested... Thanks for the feedback! Cheers, f
Hi Fernando 2009/5/28 Fernando Perez <fperez.net@gmail.com>:
Well, since dtypes allow for nesting full arrays in this fashion, where I can say that the 'block' field can have (2,3) shape, it seems like it would be nice to be able to express this nesting into loading of plain text files as well.
I think that would be very useful. Please verify whether http://projects.scipy.org/numpy/changeset/7022 does the trick! Cheers Stéfan
2009/5/27 Stéfan van der Walt <stefan@sun.ac.za>:
Hi Fernando
2009/5/28 Fernando Perez <fperez.net@gmail.com>:
Well, since dtypes allow for nesting full arrays in this fashion, where I can say that the 'block' field can have (2,3) shape, it seems like it would be nice to be able to express this nesting into loading of plain text files as well.
I think that would be very useful. Please verify whether
http://projects.scipy.org/numpy/changeset/7022
does the trick!
beeooteefool! No warnings now: uqbar[recarray]> python rec_nested.py Second pass - data loaded OK. This is great, many thanks :) Cheers, f
On May 27, 2009, at 7:29 PM, Stéfan van der Walt wrote:
Hi Fernando
2009/5/28 Fernando Perez <fperez.net@gmail.com>:
Well, since dtypes allow for nesting full arrays in this fashion, where I can say that the 'block' field can have (2,3) shape, it seems like it would be nice to be able to express this nesting into loading of plain text files as well.
I think that would be very useful. Please verify whether
I fixed it for genfromtxt as well (r7023). Should we backport the changes ?
participants (3)
-
Fernando Perez
-
Pierre GM
-
Stéfan van der Walt