[Tutor] summary stats grouped by month year
questions anon
questions.anon at gmail.com
Wed May 9 06:32:44 CEST 2012
excellent thank you for the warning, I will look into dictionaries alot
more carefully now. I have some simple questions that I would like to ask
straight away but will try and figure a few things out on my own first.
Thanks again!!
On Wed, May 9, 2012 at 11:16 AM, Andre' Walker-Loud <walksloud at gmail.com>wrote:
> dear anonymous questioner,
>
> > Excellent, thank you so much. I don't understand all the steps at this
> stage so I will need some time to go through it carefully but it works
> perfectly.
> > Thanks again!
>
> words of caution - you will notice that the way I constructed the data
> file - I assumed what the input file would look like (they come
> chronologically and all data are present - no missing years or months).
> While this might be safe for a data file you constructed, and one that is
> short enough to read - this is generally a very bad habit - hence my
> encouraging you to figure out how to make dictionaries.
>
> Imagine how you would read a file that you got from a colleague, which was
> so long you can not look by eye and see that it is intact, or perhaps you
> know that in 1983, the month of June is missing as well as some other holes
> in the data. And perhaps your colleague decided, since those data are
> missing, just don't write the data to the file, so instead of having
>
> 1983 May 2.780009889
> 1983 June nan
> 1983 July 0.138150181
>
> you have
>
> 1983 May 2.780009889
> 1983 July 0.138150181
>
> now the little loop I showed you will fail to place the data in the
> correct place in your numpy array, and you will get all your averaging and
> other analysis wrong.
>
>
> Instead - you can create dictionaries for the years and months. Then when
> you read in the data, you can grab this info to correctly place it in the
> right spot
>
> # years and months are your dictionaries
> years = {'1972':0,'1973':1,....}
> months = {'Jan':0,'Feb':1,...,'Dec':11}
> data = open(your_data).readlines()
> for line in data:
> year,month,dat = line.split()
> y = int(('%('+year+')s') % years)
> m = int(('%('+month+')s') % months)
> rain_fall[y,m] = float(dat)
>
> [also - if someone knows how to use the dictionaries more appropriately
> here - please chime in]
>
> then you also have to think about what to do with the empty data sets.
> The initialization
>
> rain_fall = np.zeros([n_years,n_months])
>
> will have placed zeros everywhere - and if the data is missing - it won't
> get re-written. So that will make your analysis bogus also - so you have
> to walk through and replace the zeros with something else, like 'nan'. And
> then you could think about replacing missing data by averages - eg. replace
> a missing June entry by the average over all the non-zero June data.
>
>
> I was just hoping to give you a working example that you could use to make
> a functioning well thought out example that can handle the exceptions which
> will arise (like missing data, or a data file with a string where a float
> should be etc')
>
>
> Have fun!
>
> Andre
>
>
>
>
> On May 8, 2012, at 5:48 PM, questions anon wrote:
>
> > On Tue, May 8, 2012 at 4:41 PM, Andre' Walker-Loud <walksloud at gmail.com>
> wrote:
> > Hello anonymous questioner,
> >
> > first comment - you may want to look into hdf5 data structures
> >
> > http://www.hdfgroup.org/HDF5/
> >
> > and the python tools to play with them
> >
> > pytables - http://www.pytables.org/moin
> > h5py - http://code.google.com/p/h5py/
> >
> > I have personally used pytables more - but not for any good reason. If
> you happen to have the Enthought python distribution - these come with the
> package, as well as an installation of hdf5
> >
> > hdf5 is a very nice file format for storing large amounts of data
> (binary) with descriptive meta-data. Also, numpy plays very nice with
> hdf5. Given all your questions here, I suspect you would benefit from
> learning about these and learning to play with them.
> >
> > Now to your specific question.
> >
> > > I would like to calculate summary statistics of rainfall based on year
> and month.
> > > I have the data in a text file (although could put in any format if it
> helps) extending over approx 40 years:
> > > YEAR MONTH MeanRain
> > > 1972 Jan 12.7083199
> > > 1972 Feb 14.17007142
> > > 1972 Mar 14.5659302
> > > 1972 Apr 1.508517302
> > > 1972 May 2.780009889
> > > 1972 Jun 1.609619287
> > > 1972 Jul 0.138150181
> > > 1972 Aug 0.214346148
> > > 1972 Sep 1.322102228
> > >
> > > I would like to be able to calculate the total rain annually:
> > >
> > > YEAR Annualrainfall
> > > 1972 400
> > > 1973 300
> > > 1974 350
> > > ....
> > > 2011 400
> > >
> > > and also the monthly mean rainfall for all years:
> > >
> > > YEAR MonthlyMeanRain
> > > Jan 13
> > > Feb 15
> > > Mar 8
> > > .....
> > > Dec 13
> > >
> > >
> > > Is this something I can easily do?
> >
> > Yes - this should be very easy. Imagine importing all this data into a
> numpy array
> >
> > ===
> > import numpy as np
> >
> > data = open(your_data).readlines()
> > years = []
> > for line in data:
> > if line.split()[0] not in years:
> > years.append(line.split()[0])
> > months = ['Jan','Feb',....,'Dec']
> >
> > rain_fall = np.zeros([len(n_year),len(months)])
> > for y,year in enumerate(years):
> > for m,month in enumerate(months):
> > rain_fall[y,m] = float(data[ y * 12 + m].split()[2])
> >
> > # to get average per year - average over months - axis=1
> > print np.mean(rain_fall,axis=1)
> >
> > # to get average per month - average over years - axis=0
> > print np.mean(rain_fall,axis=0)
> >
> > ===
> >
> > now you should imagine doing this by setting up dictionaries, so that
> you can request an average for year 1972 or for month March. That is why I
> used the enumerate function before to walk the indices - so that you can
> imagine building the dictionary simultaneously.
> >
> > years = {'1972':0, '1973':1, ....}
> > months = {'Jan':0,'Feb':1,...'Dec':11}
> >
> > then you can access and store the data to the array using these
> dictionaries.
> >
> > print rain_fall[int('%(1984)s' % years), int('%(March)s' % months)]
> >
> >
> > Andre
> >
> >
> >
> >
> >
> > > I have started by simply importing the text file but data is not
> represented as time so that is probably my first problem and then I am not
> sure how to group them by month/year.
> > >
> > > textfile=r"textfile.txt"
> > > f=np.genfromtxt(textfile,skip_header=1)
> > >
> > > Any feedback will be greatly appreciated.
> > >
> > > _______________________________________________
> > > Tutor maillist - Tutor at python.org
> > > To unsubscribe or change subscription options:
> > > http://mail.python.org/mailman/listinfo/tutor
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120509/8d9840a0/attachment-0001.html>
More information about the Tutor
mailing list