Re: [Numpy-discussion] data type specification when using numpy.genfromtxt
Hi Derek! I tried with the lastest version of python(x,y) package with numpy version of 1.6.0. I gave the data to you with reduced columns (10 column) and rows. b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,usecols=tuple(range(10)),dtype=['S10'] + [ float for n in range(9)]) works. if you change usecols=tuple(range(10)) to usecols=range(10), it still works. b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=None) works. but b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=['S10'] + [ float for n in range(9)]) didn't work. I use Python(x,y)-2.6.6.1 with numpy version as 1.6.0, I use windows 32-bit system. Please don't spend too much time on this if it's not a potential problem. the final thing is, when I try to do this (I want to try the missing_values in numpy 1.6.0), it gives error: In [33]: import StringIO as StringIO In [34]: data = "1, 2, 3\n4, 5, 6" In [35]: np.genfromtxt(StringIO(data), delimiter=",",dtype="int,int,int",missing_values=2) --------------------------------------------------------------------------- TypeError Traceback (most recent call last) D:\data\LaThuile_ancillary\Jim_Randerson_data\<ipython console> in <module>() TypeError: 'module' object is not callable I think it must be some problem of my own python configuration? Much thanks again, cheers, Chao 2011/6/27 Derek Homeier <derek@astro.physik.uni-goettingen.de> > Hi Chao, > > this seems to have become quite a number of different issues! > But let's make sure I understand what's going on... > > > Thanks very much for your quick reply. I make a short summary of what > I've tried. Actually the ['S10'] + [ float for n in range(48) ] only works > when you explicitly specify the columns to be read, and genfromtxt cannot > automatically determine the type if you don't specify the type.... > > > > > In [164]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=tuple(range(49)),dtype=['S10'] > + [ float for n in range(48)]) > ... > > But if I use the following, it gives error: > > > > In [171]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,dtype=['S > > 10'] + [ float for n in range(48)]) > > > --------------------------------------------------------------------------- > > ValueError Traceback (most recent call > last) > > > And the above (without the usecols) did work if you explicitly typed > dtype=('S10', float, float....)? That by itself would be quite weird, > because the two should be completely equivalent. > What happens if you cast the generated list to a tuple - > dtype=tuple(['S10'] + [ float for n in range(48)])? > If you are using a recent numpy version (1.6.0 or 1.6.1rc1), could you > please file a bug report with complete machine info etc.? But I suspect this > might be an older version, you should also be able to simply use > 'usecols=range(49)' (without the tuple()). Either way, I cannot reproduce > this behaviour with the current numpy version. > > > If I don't specify the dtype, it will not recognize the type of the first > column (it displays as nan): > > > > In [172]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=(0,1,2)) > > > > In [173]: b > > Out[173]: > > array([(nan, -999.0, -1.028), (nan, -999.0, -0.40899999999999997), > > (nan, -999.0, 0.16700000000000001), ..., (nan, -999.0, -999.0), > > (nan, -999.0, -999.0), (nan, -999.0, -999.0)], > > dtype=[('TIMESTAMP', '<f8'), ('CO2_flux', '<f8'), ('Net_radiation', > '<f8') > > ]) > > > You _do_ have to specify 'dtype=None', since the default is 'dtype=float', > as I have remarked in my previous mail. If this does not work, it could be a > matter of the numpy version gain - there were a number of type conversion > issues fixed between 1.5.1 and 1.6.0. > > > > Then the final question is, actually the '-999.0' in the data is missing > value, but I cannot display it as 'nan' by specifying the missing_values as > '-999.0': > > but either I set the missing_values as -999.0 or using a dictionary, it > neither work... > ... > > > > Even this doesn't work (suppose 2 is our missing_value), > > In [184]: data = "1, 2, 3\n4, 5, 6" > > > > In [185]: np.genfromtxt(StringIO(data), > delimiter=",",dtype="int,int,int",missin > > g_values=2) > > Out[185]: > > array([(1, 2, 3), (4, 5, 6)], > > dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')]) > > OK, same behaviour here - I found the only tests involving 'valid numbers' > as missing_values use masked arrays; for regular ndarrays they seem to be > ignored. I don't know if this is by design - the question is, what do you > need to do with the data if you know ' -999' always means a missing value? > You could certainly manipulate them after reading in... > If you have to convert them already on reading in, and using np.mafromtxt > is not an option, your best bet may be to define a custom converter like > (note you have to include any blanks, if present) > > conv = dict(((n, lambda s: s==' -999' and np.nan or float(s)) for n in > range(1,49))) > > Cheers, > Derek > > -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************
Hi Chao, by mistake did not reply to the list last time... On 27.06.2011, at 10:30PM, Chao YUE wrote: Hi Derek!
I tried with the lastest version of python(x,y) package with numpy version of 1.6.0. I gave the data to you with reduced columns (10 column) and rows.
b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,usecols=tuple(range(10)),dtype=['S10'] + [ float for n in range(9)]) works. if you change usecols=tuple(range(10)) to usecols=range(10), it still works.
b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=None) works.
but b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=['S10'] + [ float for n in range(9)]) didn't work.
I use Python(x,y)-2.6.6.1 with numpy version as 1.6.0, I use windows 32-bit system.
Please don't spend too much time on this if it's not a potential problem.
OK, dtype=None works on 1.6.0, that's the important bit. From your example file it seems the dtype list does work not without specifying usecols, because your header contains and excess semicolon in the field "Air temperature (High; HMP45C)", thus genfromtxt expects more data columns than actually exist. If you replace the semicolon you should be set (or, if I may suggest, write another header line with catchier field names so you don't have to work with array fields like "b['Water vapor density by LiCor 7500']" ;-). Otherwise both options work for me with python2.6+numpy-1.5.1 as well as 1.6.0/1.6.1rc1. I am curious though why your python interpreter gave this error message:
ValueError Traceback (most recent call last)
D:\data\LaThuile_ancillary\Jim_Randerson_data\<ipython console> in <module>()
C:\Python26\lib\site-packages\numpy\lib\npyio.pyc in genfromtxt(fname, dtype, co mments, delimiter, skiprows, skip_header, skip_footer, converters, missing, miss ing_values, filling_values, usecols, names, excludelist, deletechars, replace_sp ace, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, invalid_rais e) 1449 # Raise an exception ?
1450 if invalid_raise: -> 1451 raise ValueError(errmsg) 1452 # Issue a warning ?
1453 else:
ValueError
since ipython2.6 on my Mac reported this: ... 1450 if invalid_raise: -> 1451 raise ValueError(errmsg) 1452 # Issue a warning ? 1453 else: ValueError: Some errors were detected ! Line #3 (got 10 columns instead of 11) Line #4 (got 10 columns instead of 11) etc.... which of course provided the right lead to the problem - was the actual errmsg really missing, or did you cut the message too soon?
the final thing is, when I try to do this (I want to try the missing_values in numpy 1.6.0), it gives error:
In [33]: import StringIO as StringIO
In [34]: data = "1, 2, 3\n4, 5, 6"
In [35]: np.genfromtxt(StringIO(data), delimiter=",",dtype="int,int,int",missing_values=2) --------------------------------------------------------------------------- TypeError Traceback (most recent call last)
D:\data\LaThuile_ancillary\Jim_Randerson_data\<ipython console> in <module>()
TypeError: 'module' object is not callable
You want to use "from StringIO import StringIO" (or write "StringIO.StringIO(data)". But again, this will not work the way you expect it to with int/float numbers set as missing_values, and reading to regular arrays. I've tested this on 1.6.1 and the current development branch as well, and the missing_values are only considered for masked arrays. This is not likely to change soon, and may actually be intentional, so to process those numbers on read-in, your best option would be to define a custom set of "converters=conv" as shown in my last mail. Cheers, Derek
2011/6/27 Derek Homeier <derek@astro.physik.uni-goettingen.de> Hi Chao,
this seems to have become quite a number of different issues! But let's make sure I understand what's going on...
Thanks very much for your quick reply. I make a short summary of what I've tried. Actually the ['S10'] + [ float for n in range(48) ] only works when you explicitly specify the columns to be read, and genfromtxt cannot automatically determine the type if you don't specify the type....
In [164]: b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=tuple(range(49)),dtype=['S10'] + [ float for n in range(48)]) ... But if I use the following, it gives error:
In [171]: b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,dtype=['S 10'] + [ float for n in range(48)]) --------------------------------------------------------------------------- ValueError Traceback (most recent call last)
And the above (without the usecols) did work if you explicitly typed dtype=('S10', float, float....)? That by itself would be quite weird, because the two should be completely equivalent. What happens if you cast the generated list to a tuple - dtype=tuple(['S10'] + [ float for n in range(48)])? If you are using a recent numpy version (1.6.0 or 1.6.1rc1), could you please file a bug report with complete machine info etc.? But I suspect this might be an older version, you should also be able to simply use 'usecols=range(49)' (without the tuple()). Either way, I cannot reproduce this behaviour with the current numpy version.
If I don't specify the dtype, it will not recognize the type of the first column (it displays as nan):
In [172]: b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=(0,1,2))
In [173]: b Out[173]: array([(nan, -999.0, -1.028), (nan, -999.0, -0.40899999999999997), (nan, -999.0, 0.16700000000000001), ..., (nan, -999.0, -999.0), (nan, -999.0, -999.0), (nan, -999.0, -999.0)], dtype=[('TIMESTAMP', '<f8'), ('CO2_flux', '<f8'), ('Net_radiation', '<f8') ])
You _do_ have to specify 'dtype=None', since the default is 'dtype=float', as I have remarked in my previous mail. If this does not work, it could be a matter of the numpy version gain - there were a number of type conversion issues fixed between 1.5.1 and 1.6.0.
Then the final question is, actually the '-999.0' in the data is missing value, but I cannot display it as 'nan' by specifying the missing_values as '-999.0': but either I set the missing_values as -999.0 or using a dictionary, it neither work...
...
Even this doesn't work (suppose 2 is our missing_value), In [184]: data = "1, 2, 3\n4, 5, 6"
In [185]: np.genfromtxt(StringIO(data), delimiter=",",dtype="int,int,int",missin g_values=2) Out[185]: array([(1, 2, 3), (4, 5, 6)], dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')])
OK, same behaviour here - I found the only tests involving 'valid numbers' as missing_values use masked arrays; for regular ndarrays they seem to be ignored. I don't know if this is by design - the question is, what do you need to do with the data if you know ' -999' always means a missing value? You could certainly manipulate them after reading in... If you have to convert them already on reading in, and using np.mafromtxt is not an option, your best bet may be to define a custom converter like (note you have to include any blanks, if present)
conv = dict(((n, lambda s: s==' -999' and np.nan or float(s)) for n in range(1,49)))
Cheers, Derek
-- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************
<99Burn2003all_new.csv>
Thanks very much!! you are right. It's becuase the extra semicolon in the head row. I have no problems anymore. I thank you for your time. cheeers, Chao 2011/6/28 Derek Homeier <derek@astro.physik.uni-goettingen.de> > Hi Chao, > > by mistake did not reply to the list last time... > > On 27.06.2011, at 10:30PM, Chao YUE wrote: > Hi Derek! > > > > I tried with the lastest version of python(x,y) package with numpy > version of 1.6.0. I gave the data to you with reduced columns (10 column) > and rows. > > > > > b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,usecols=tuple(range(10)),dtype=['S10'] > + [ float for n in range(9)]) works. > > if you change usecols=tuple(range(10)) to usecols=range(10), it still > works. > > > > > b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=None) > works. > > > > but > b=np.genfromtxt('99Burn2003all_new.csv',delimiter=';',names=True,dtype=['S10'] > + [ float for n in range(9)]) didn't work. > > > > I use Python(x,y)-2.6.6.1 with numpy version as 1.6.0, I use windows > 32-bit system. > > > > Please don't spend too much time on this if it's not a potential problem. > > > OK, dtype=None works on 1.6.0, that's the important bit. > >From your example file it seems the dtype list does work not without > specifying usecols, because your header contains and excess semicolon in the > field "Air temperature (High; HMP45C)", thus genfromtxt expects more data > columns than actually exist. If you replace the semicolon you should be set > (or, if I may suggest, write another header line with catchier field names > so you don't have to work with array fields like "b['Water vapor density by > LiCor 7500']" ;-). > Otherwise both options work for me with python2.6+numpy-1.5.1 as well as > 1.6.0/1.6.1rc1. > > I am curious though why your python interpreter gave this error message: > > ValueError Traceback (most recent call > last) > > > > D:\data\LaThuile_ancillary\Jim_Randerson_data\<ipython console> in > <module>() > > > > C:\Python26\lib\site-packages\numpy\lib\npyio.pyc in genfromtxt(fname, > dtype, co > > mments, delimiter, skiprows, skip_header, skip_footer, converters, > missing, miss > > ing_values, filling_values, usecols, names, excludelist, deletechars, > replace_sp > > ace, autostrip, case_sensitive, defaultfmt, unpack, usemask, loose, > invalid_rais > > e) > > 1449 # Raise an exception ? > > > > 1450 if invalid_raise: > > -> 1451 raise ValueError(errmsg) > > 1452 # Issue a warning ? > > > > 1453 else: > > > > ValueError > > since ipython2.6 on my Mac reported this: > ... > 1450 if invalid_raise: > -> 1451 raise ValueError(errmsg) > 1452 # Issue a warning ? > > 1453 else: > > ValueError: Some errors were detected ! > Line #3 (got 10 columns instead of 11) > Line #4 (got 10 columns instead of 11) > etc.... > which of course provided the right lead to the problem - was the actual > errmsg really missing, or did you cut the message too soon? > > > the final thing is, when I try to do this (I want to try the > missing_values in numpy 1.6.0), it gives error: > > > > In [33]: import StringIO as StringIO > > > > In [34]: data = "1, 2, 3\n4, 5, 6" > > > > In [35]: np.genfromtxt(StringIO(data), > delimiter=",",dtype="int,int,int",missing_values=2) > > > --------------------------------------------------------------------------- > > TypeError Traceback (most recent call > last) > > > > D:\data\LaThuile_ancillary\Jim_Randerson_data\<ipython console> in > <module>() > > > > TypeError: 'module' object is not callable > > > You want to use "from StringIO import StringIO" (or write > "StringIO.StringIO(data)". > But again, this will not work the way you expect it to with int/float > numbers set as missing_values, and reading to regular arrays. I've tested > this on 1.6.1 and the current development branch as well, and the > missing_values are only considered for masked arrays. This is not likely to > change soon, and may actually be intentional, so to process those numbers on > read-in, your best option would be to define a custom set of > "converters=conv" as shown in my last mail. > > Cheers, > Derek > > > 2011/6/27 Derek Homeier <derek@astro.physik.uni-goettingen.de> > > Hi Chao, > > > > this seems to have become quite a number of different issues! > > But let's make sure I understand what's going on... > > > > > Thanks very much for your quick reply. I make a short summary of what > I've tried. Actually the ['S10'] + [ float for n in range(48) ] only works > when you explicitly specify the columns to be read, and genfromtxt cannot > automatically determine the type if you don't specify the type.... > > > > > > > > In [164]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=tuple(range(49)),dtype=['S10'] > + [ float for n in range(48)]) > > ... > > > But if I use the following, it gives error: > > > > > > In [171]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,dtype=['S > > > 10'] + [ float for n in range(48)]) > > > > --------------------------------------------------------------------------- > > > ValueError Traceback (most recent call > last) > > > > > And the above (without the usecols) did work if you explicitly typed > dtype=('S10', float, float....)? That by itself would be quite weird, > because the two should be completely equivalent. > > What happens if you cast the generated list to a tuple - > dtype=tuple(['S10'] + [ float for n in range(48)])? > > If you are using a recent numpy version (1.6.0 or 1.6.1rc1), could you > please file a bug report with complete machine info etc.? But I suspect this > might be an older version, you should also be able to simply use > 'usecols=range(49)' (without the tuple()). Either way, I cannot reproduce > this behaviour with the current numpy version. > > > > > If I don't specify the dtype, it will not recognize the type of the > first column (it displays as nan): > > > > > > In [172]: > b=np.genfromtxt('99Burn2003all.csv',delimiter=';',names=True,usecols=(0,1,2)) > > > > > > In [173]: b > > > Out[173]: > > > array([(nan, -999.0, -1.028), (nan, -999.0, -0.40899999999999997), > > > (nan, -999.0, 0.16700000000000001), ..., (nan, -999.0, -999.0), > > > (nan, -999.0, -999.0), (nan, -999.0, -999.0)], > > > dtype=[('TIMESTAMP', '<f8'), ('CO2_flux', '<f8'), > ('Net_radiation', '<f8') > > > ]) > > > > > You _do_ have to specify 'dtype=None', since the default is > 'dtype=float', as I have remarked in my previous mail. If this does not > work, it could be a matter of the numpy version gain - there were a number > of type conversion issues fixed between 1.5.1 and 1.6.0. > > > > > > Then the final question is, actually the '-999.0' in the data is > missing value, but I cannot display it as 'nan' by specifying the > missing_values as '-999.0': > > > but either I set the missing_values as -999.0 or using a dictionary, it > neither work... > > ... > > > > > > Even this doesn't work (suppose 2 is our missing_value), > > > In [184]: data = "1, 2, 3\n4, 5, 6" > > > > > > In [185]: np.genfromtxt(StringIO(data), > delimiter=",",dtype="int,int,int",missin > > > g_values=2) > > > Out[185]: > > > array([(1, 2, 3), (4, 5, 6)], > > > dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4')]) > > > > OK, same behaviour here - I found the only tests involving 'valid > numbers' as missing_values use masked arrays; for regular ndarrays they seem > to be ignored. I don't know if this is by design - the question is, what do > you need to do with the data if you know ' -999' always means a missing > value? You could certainly manipulate them after reading in... > > If you have to convert them already on reading in, and using np.mafromtxt > is not an option, your best bet may be to define a custom converter like > (note you have to include any blanks, if present) > > > > conv = dict(((n, lambda s: s==' -999' and np.nan or float(s)) for n in > range(1,49))) > > > > Cheers, > > Derek > > > > > > > > > > -- > > > *********************************************************************************** > > Chao YUE > > Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) > > UMR 1572 CEA-CNRS-UVSQ > > Batiment 712 - Pe 119 > > 91191 GIF Sur YVETTE Cedex > > Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 > > > ************************************************************************************ > > > > <99Burn2003all_new.csv> > > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > -- *********************************************************************************** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 77 30; Fax:01.69.08.77.16 ************************************************************************************
participants (2)
-
Chao YUE
-
Derek Homeier