[Tutor] summary stats grouped by month year

questions anon questions.anon at gmail.com
Wed May 9 06:32:44 CEST 2012


excellent thank you for the warning, I will look into dictionaries alot
more carefully now. I have some simple questions that I would like to ask
straight away but will try and figure a few things out on my own first.

Thanks again!!

On Wed, May 9, 2012 at 11:16 AM, Andre' Walker-Loud <walksloud at gmail.com>wrote:

> dear anonymous questioner,
>
> > Excellent, thank you so much. I don't understand all the steps at this
> stage so I will need some time to go through it carefully but it works
> perfectly.
> > Thanks again!
>
> words of caution - you will notice that the way I constructed the data
> file - I assumed what the input file would look like (they come
> chronologically and all data are present - no missing years or months).
>  While this might be safe for a data file you constructed, and one that is
> short enough to read - this is generally a very bad habit - hence my
> encouraging you to figure out how to make dictionaries.
>
> Imagine how you would read a file that you got from a colleague, which was
> so long you can not look by eye and see that it is intact, or perhaps you
> know that in 1983, the month of June is missing as well as some other holes
> in the data.  And perhaps your colleague decided, since those data are
> missing, just don't write the data to the file, so instead of having
>
> 1983 May 2.780009889
> 1983 June nan
> 1983 July 0.138150181
>
> you have
>
> 1983 May 2.780009889
> 1983 July 0.138150181
>
> now the little loop I showed you will fail to place the data in the
> correct place in your numpy array, and you will get all your averaging and
> other analysis wrong.
>
>
> Instead - you can create dictionaries for the years and months.  Then when
> you read in the data, you can grab this info to correctly place it in the
> right spot
>
> # years and months are your dictionaries
> years = {'1972':0,'1973':1,....}
> months = {'Jan':0,'Feb':1,...,'Dec':11}
> data = open(your_data).readlines()
> for line in data:
>        year,month,dat = line.split()
>        y = int(('%('+year+')s') % years)
>        m = int(('%('+month+')s') % months)
>        rain_fall[y,m] = float(dat)
>
> [also - if someone knows how to use the dictionaries more appropriately
> here - please chime in]
>
> then you also have to think about what to do with the empty data sets.
>  The initialization
>
> rain_fall = np.zeros([n_years,n_months])
>
> will have placed zeros everywhere - and if the data is missing - it won't
> get re-written.  So that will make your analysis bogus also - so you have
> to walk through and replace the zeros with something else, like 'nan'.  And
> then you could think about replacing missing data by averages - eg. replace
> a missing June entry by the average over all the non-zero June data.
>
>
> I was just hoping to give you a working example that you could use to make
> a functioning well thought out example that can handle the exceptions which
> will arise (like missing data, or a data file with a string where a float
> should be etc')
>
>
> Have fun!
>
> Andre
>
>
>
>
> On May 8, 2012, at 5:48 PM, questions anon wrote:
>
> > On Tue, May 8, 2012 at 4:41 PM, Andre' Walker-Loud <walksloud at gmail.com>
> wrote:
> > Hello anonymous questioner,
> >
> > first comment - you may want to look into hdf5 data structures
> >
> > http://www.hdfgroup.org/HDF5/
> >
> > and the python tools to play with them
> >
> > pytables - http://www.pytables.org/moin
> > h5py - http://code.google.com/p/h5py/
> >
> > I have personally used pytables more - but not for any good reason.  If
> you happen to have the Enthought python distribution - these come with the
> package, as well as an installation of hdf5
> >
> > hdf5 is a very nice file format for storing large amounts of data
> (binary) with descriptive meta-data.  Also, numpy plays very nice with
> hdf5.  Given all your questions here, I suspect you would benefit from
> learning about these and learning to play with them.
> >
> > Now to your specific question.
> >
> > > I would like to calculate summary statistics of rainfall based on year
> and month.
> > > I have the data in a text file (although could put in any format if it
> helps) extending over approx 40 years:
> > > YEAR MONTH    MeanRain
> > > 1972 Jan    12.7083199
> > > 1972 Feb    14.17007142
> > > 1972 Mar    14.5659302
> > > 1972 Apr    1.508517302
> > > 1972 May    2.780009889
> > > 1972 Jun    1.609619287
> > > 1972 Jul    0.138150181
> > > 1972 Aug    0.214346148
> > > 1972 Sep    1.322102228
> > >
> > > I would like to be able to calculate the total rain annually:
> > >
> > > YEAR   Annualrainfall
> > > 1972    400
> > > 1973    300
> > > 1974    350
> > > ....
> > > 2011     400
> > >
> > > and also the monthly mean rainfall for all years:
> > >
> > > YEAR  MonthlyMeanRain
> > > Jan      13
> > > Feb      15
> > > Mar       8
> > > .....
> > > Dec       13
> > >
> > >
> > > Is this something I can easily do?
> >
> > Yes - this should be very easy.  Imagine importing all this data into a
> numpy array
> >
> > ===
> > import numpy as np
> >
> > data = open(your_data).readlines()
> > years = []
> > for line in data:
> >        if line.split()[0] not in years:
> >                years.append(line.split()[0])
> > months = ['Jan','Feb',....,'Dec']
> >
> > rain_fall = np.zeros([len(n_year),len(months)])
> > for y,year in enumerate(years):
> >        for m,month in enumerate(months):
> >                rain_fall[y,m] = float(data[ y * 12 + m].split()[2])
> >
> > # to get average per year - average over months - axis=1
> > print np.mean(rain_fall,axis=1)
> >
> > # to get average per month - average over years - axis=0
> > print np.mean(rain_fall,axis=0)
> >
> > ===
> >
> > now you should imagine doing this by setting up dictionaries, so that
> you can request an average for year 1972 or for month March.  That is why I
> used the enumerate function before to walk the indices - so that you can
> imagine building the dictionary simultaneously.
> >
> > years = {'1972':0, '1973':1, ....}
> > months = {'Jan':0,'Feb':1,...'Dec':11}
> >
> > then you can access and store the data to the array using these
> dictionaries.
> >
> > print rain_fall[int('%(1984)s' % years), int('%(March)s' % months)]
> >
> >
> > Andre
> >
> >
> >
> >
> >
> > > I have started by simply importing the text file but data is not
> represented as time so that is probably my first problem and then I am not
> sure how to group them by month/year.
> > >
> > > textfile=r"textfile.txt"
> > > f=np.genfromtxt(textfile,skip_header=1)
> > >
> > > Any feedback will be greatly appreciated.
> > >
> > > _______________________________________________
> > > Tutor maillist  -  Tutor at python.org
> > > To unsubscribe or change subscription options:
> > > http://mail.python.org/mailman/listinfo/tutor
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120509/8d9840a0/attachment-0001.html>


More information about the Tutor mailing list