loadtxt() behavior on single-line files
Hi, I was having the hardest time trying to figure out an intermittent bug in one of my programs. Essentially, in some situations, it was throwing an error saying that the array object was not an array. It took me a while, but then I figured out that my program was assuming that the object returned from a loadtxt() call was always a structured array (I was using dtypes). However, if the data file being loaded only had one data record, then all you get back is a structured record. import numpy as np from StringIO import StringIO strData = StringIO("89.23 47.2\n13.2 42.2") a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print "Length Two" print a print a.shape print len(a) strData = StringIO("53.2 49.2") a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print "\n\nLength One" print a print a.shape try : print len(a) except TypeError as err print "ERROR:", err Which gets me this output: Length Two [(89.230000000000004, 47.200000000000003) (13.199999999999999, 42.200000000000003)] (2,) 2 Length One (53.200000000000003, 49.200000000000003) () ERROR: len() of unsized object Note that this isn't restricted to structured arrays. For regular ndarrays, loadtxt() appears to mimic the behavior of np.squeeze():
a = np.ones((1, 1, 1)) np.squeeze(a)[0] IndexError: 0-d arrays can't be indexed
strData = StringIO("53.2") a = np.loadtxt(strData) a[0] IndexError: 0-d arrays can't be indexed
So, if you have multiple lines with multiple columns, you get a 2-D array, as expected. if you have a single line of data with multiple columns, you get a 1-D array. If you have a single column with many lines, you also get a 1-D array (which is probably expected, I guess). If you have a single column with a single line, you get a scalar (actually, a 0-D array). Is this a bug or a feature? I can see the advantages of having loadtxt() returning the lowest # of dimensions that can hold the given data, but it leaves the code vulnerable to certain edge cases. Maybe there is a different way I should be doing this, but I feel that this behavior at the very least should be included in the loadtxt documentation. Ben Root
Benjamin Root wrote:
Hi,
I was having the hardest time trying to figure out an intermittent bug in one of my programs. Essentially, in some situations, it was throwing an error saying that the array object was not an array. It took me a while, but then I figured out that my program was assuming that the object returned from a loadtxt() call was always a structured array (I was using dtypes). However, if the data file being loaded only had one data record, then all you get back is a structured record.
import numpy as np from StringIO import StringIO
strData = StringIO("89.23 47.2\n13.2 42.2") a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print "Length Two" print a print a.shape print len(a)
strData = StringIO("53.2 49.2") a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print "\n\nLength One" print a print a.shape try : print len(a) except TypeError as err print "ERROR:", err
Which gets me this output:
Length Two [(89.230000000000004, 47.200000000000003) (13.199999999999999, 42.200000000000003)] (2,) 2
Length One (53.200000000000003, 49.200000000000003) () ERROR: len() of unsized object
Note that this isn't restricted to structured arrays. For regular ndarrays, loadtxt() appears to mimic the behavior of np.squeeze():
Exactly. The last four lines of the function are: X = np.squeeze(X) if unpack: return X.T else: return X
a = np.ones((1, 1, 1)) np.squeeze(a)[0] IndexError: 0-d arrays can't be indexed
strData = StringIO("53.2") a = np.loadtxt(strData) a[0] IndexError: 0-d arrays can't be indexed
So, if you have multiple lines with multiple columns, you get a 2-D array, as expected. if you have a single line of data with multiple columns, you get a 1-D array. If you have a single column with many lines, you also get a 1-D array (which is probably expected, I guess). If you have a single column with a single line, you get a scalar (actually, a 0-D array).
Is this a bug or a feature? I can see the advantages of having loadtxt() returning the lowest # of dimensions that can hold the given data, but it leaves the code vulnerable to certain edge cases. Maybe there is a different way I should be doing this, but I feel that this behavior at the very least should be included in the loadtxt documentation.
It would be useful to be able to tell loadtxt to not call squeeze, so a program that reads column-formatted data doesn't have to treat the case of a single line specially. Warren
Warren Weckesser wrote:
Benjamin Root wrote:
Note that this isn't restricted to structured arrays. For regular ndarrays, loadtxt() appears to mimic the behavior of np.squeeze():
Exactly. The last four lines of the function are:
X = np.squeeze(X) if unpack: return X.T else: return X
It would be useful to be able to tell loadtxt to not call squeeze, so a program that reads column-formatted data doesn't have to treat the case of a single line specially.
I agree -- it seem to me that every time I load data, I know what shape I expect the result to be -- I'd never want it to squeeze. It might be nice if you could specify the dimensionality of the array you want. But for now: can you just do a reshape? In [42]: strData = StringIO("53.2 49.2") In[43]:a=p.loadtxt(strData,dtype=[('x',float),('y',float)]).reshape((-1,)) In [45]: a.shape Out[45]: (1,) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On Thu, Jun 24, 2010 at 1:00 PM, Warren Weckesser < warren.weckesser@enthought.com> wrote:
Benjamin Root wrote:
Hi,
I was having the hardest time trying to figure out an intermittent bug in one of my programs. Essentially, in some situations, it was throwing an error saying that the array object was not an array. It took me a while, but then I figured out that my program was assuming that the object returned from a loadtxt() call was always a structured array (I was using dtypes). However, if the data file being loaded only had one data record, then all you get back is a structured record.
import numpy as np from StringIO import StringIO
strData = StringIO("89.23 47.2\n13.2 42.2") a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print "Length Two" print a print a.shape print len(a)
strData = StringIO("53.2 49.2") a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print "\n\nLength One" print a print a.shape try : print len(a) except TypeError as err print "ERROR:", err
Which gets me this output:
Length Two [(89.230000000000004, 47.200000000000003) (13.199999999999999, 42.200000000000003)] (2,) 2
Length One (53.200000000000003, 49.200000000000003) () ERROR: len() of unsized object
Note that this isn't restricted to structured arrays. For regular ndarrays, loadtxt() appears to mimic the behavior of np.squeeze():
Exactly. The last four lines of the function are:
X = np.squeeze(X) if unpack: return X.T else: return X
a = np.ones((1, 1, 1)) np.squeeze(a)[0] IndexError: 0-d arrays can't be indexed
strData = StringIO("53.2") a = np.loadtxt(strData) a[0] IndexError: 0-d arrays can't be indexed
So, if you have multiple lines with multiple columns, you get a 2-D array, as expected. if you have a single line of data with multiple columns, you get a 1-D array. If you have a single column with many lines, you also get a 1-D array (which is probably expected, I guess). If you have a single column with a single line, you get a scalar (actually, a 0-D array).
Is this a bug or a feature? I can see the advantages of having loadtxt() returning the lowest # of dimensions that can hold the given data, but it leaves the code vulnerable to certain edge cases. Maybe there is a different way I should be doing this, but I feel that this behavior at the very least should be included in the loadtxt documentation.
It would be useful to be able to tell loadtxt to not call squeeze, so a program that reads column-formatted data doesn't have to treat the case of a single line specially.
Warren
I don't know if that is the best way to solve the problem. In that case, you would always get a 2-D array, right? Is that useful for those who have text data as a single column? Maybe a mindim keyword (with None as default) and apply an appropriate "atleast_Nd()" call (or maybe have available an .atleast_nd() function?). But, then what would this mean for structured arrays? One might think that they want at least 2-D, but they really want at least 1-D. Ben Root P.S. - Taking this a step further, the functions completely fail in dealing with empty files... In MATLAB, it returns an empty array (matrix?).
On Thu, Jun 24, 2010 at 1:53 PM, Benjamin Root
On Thu, Jun 24, 2010 at 1:00 PM, Warren Weckesser < warren.weckesser@enthought.com> wrote:
Benjamin Root wrote:
Hi,
I was having the hardest time trying to figure out an intermittent bug in one of my programs. Essentially, in some situations, it was throwing an error saying that the array object was not an array. It took me a while, but then I figured out that my program was assuming that the object returned from a loadtxt() call was always a structured array (I was using dtypes). However, if the data file being loaded only had one data record, then all you get back is a structured record.
import numpy as np from StringIO import StringIO
strData = StringIO("89.23 47.2\n13.2 42.2") a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print "Length Two" print a print a.shape print len(a)
strData = StringIO("53.2 49.2") a = np.loadtxt(strData, dtype=[('x', float), ('y', float)]) print "\n\nLength One" print a print a.shape try : print len(a) except TypeError as err print "ERROR:", err
Which gets me this output:
Length Two [(89.230000000000004, 47.200000000000003) (13.199999999999999, 42.200000000000003)] (2,) 2
Length One (53.200000000000003, 49.200000000000003) () ERROR: len() of unsized object
Note that this isn't restricted to structured arrays. For regular ndarrays, loadtxt() appears to mimic the behavior of np.squeeze():
Exactly. The last four lines of the function are:
X = np.squeeze(X) if unpack: return X.T else: return X
a = np.ones((1, 1, 1)) np.squeeze(a)[0] IndexError: 0-d arrays can't be indexed
strData = StringIO("53.2") a = np.loadtxt(strData) a[0] IndexError: 0-d arrays can't be indexed
So, if you have multiple lines with multiple columns, you get a 2-D array, as expected. if you have a single line of data with multiple columns, you get a 1-D array. If you have a single column with many lines, you also get a 1-D array (which is probably expected, I guess). If you have a single column with a single line, you get a scalar (actually, a 0-D array).
Is this a bug or a feature? I can see the advantages of having loadtxt() returning the lowest # of dimensions that can hold the given data, but it leaves the code vulnerable to certain edge cases. Maybe there is a different way I should be doing this, but I feel that this behavior at the very least should be included in the loadtxt documentation.
It would be useful to be able to tell loadtxt to not call squeeze, so a program that reads column-formatted data doesn't have to treat the case of a single line specially.
Warren
I don't know if that is the best way to solve the problem. In that case, you would always get a 2-D array, right? Is that useful for those who have text data as a single column? Maybe a mindim keyword (with None as default) and apply an appropriate "atleast_Nd()" call (or maybe have available an .atleast_nd() function?). But, then what would this mean for structured arrays? One might think that they want at least 2-D, but they really want at least 1-D.
Ben Root
P.S. - Taking this a step further, the functions completely fail in dealing with empty files... In MATLAB, it returns an empty array (matrix?).
I am reviving this "dead" thread to note that I have filed ticket #1562 on the numpy Trac about this issue: http://projects.scipy.org/numpy/ticket/1562 Ben Root
participants (3)
-
Benjamin Root
-
Christopher Barker
-
Warren Weckesser