[Tutor] print date and skip any repeats

Thu Oct 2 19:35:20 CEST 2014

On 09/24/2014 05:19 AM, questions anon wrote:
> Ok, I am continuing to get stuck. I think I was asking the wrong question
> so I have posted the entire script (see below).
> What I really want to do is find the daily maximum for a dataset (e.g.
> Temperature) that is contained in monthly netcdf files where the data are
> separated by hour.
> The major steps are:
> open monthly netcdf files and use the timestamp to extract the hourly data
> for the first day.
> Append the data for each hour of that day to a list, concatenate, find max
> and plot
> then loop through and do the following day.
>
> I can do some of the steps separately but I run into trouble in working out
> how to loop through each hour and separate the data into each day and then
> repeat all the steps for the following day.
>
> Any feedback will be greatly appreciated!
>
>

This is NOT the whole program.  You don't seem to define ncvariablename, 
Dataset, np, Basemap, plt, and probably others.

>
> oneday=[]
> all_the_dates=[]
> onedateperday=[]
>
>
> #open folders and open relevant netcdf files that are monthly files
> containing hourly data across a region
> for (path, dirs, files) in os.walk(MainFolder):
>          for ncfile in files:

If there are more than one file, or more than one directory, then there 
are no guarantees that this will give you the files in the order you are 
likely to want.  You may need to do some sorting to make it work out. 
But after further study, you seem to assume you'll read everything in 
from all the files, and then process it.  Have you studied how big that 
might get, and whether you'll have enough memory to make it viable?

>              if ncfile.endswith(ncvariablename+'.nc'):
>                  print "dealing with ncfiles:", path+ncfile
>                  ncfile=os.path.join(path,ncfile)
>                  ncfile=Dataset(ncfile, 'r+', 'NETCDF4')
>                  variable=ncfile.variables[ncvariablename][:,:,:]
>                  TIME=ncfile.variables['time'][:]

Since you've given ncfile MANY different meanings, this close() will not 
close the file.  Variables are cheap, use lots of them, and give them 
good names.  In this particular case, maybe using a with statement would 
make sense.

>                  ncfile.close()
>
>                  #combine variable and time so I can make calculations based
> on hours or days
>                  for v, time in zip((variable[:,:,:]),(TIME[:])):
>
>                      cdftime=utime('seconds since 1970-01-01 00:00:00')
>                      ncfiletime=cdftime.num2date(time)
>                      timestr=str(ncfiletime)
>                      utc_dt = dt.strptime(timestr, '%Y-%m-%d %H:%M:%S')
>                      au_tz = pytz.timezone('Australia/Sydney')
>                      local_dt =
> utc_dt.replace(tzinfo=pytz.utc).astimezone(au_tz)
>                      strp_local=local_dt.strftime('%Y-%m-%d_%H') #strips
> time down to date and hour
>                      local_date=local_dt.strftime('%Y-%m-%d') #strips time
> down to just date
>
>                      all_the_dates.append(local_date)
>
> #make a list that strips down the dates to only have one date per day
> (rather than one for each hour)
> onedateperday = sorted ( list (set (all_the_dates)))
>
> #loop through each day and combine data (v) where the hours occur on the
> same day
> for days in onedateperday:
>      if strp_local.startswith(days):
>          oneday.append(v)
>

And just where does v come from?  You're no longer in the loop that says 
  "for v, time in..."

> big_array=np.ma.dstack(oneday) #concatenate data
> v_DailyMax=big_array.max(axis=2) # find max
>
> #then go on to plot v_DailyMax for each day
> map = Basemap(projection='merc',llcrnrlat=-40,urcrnrlat=-33,
>          llcrnrlon=139.0,urcrnrlon=151.0,lat_ts=0,resolution='i')
> map.drawcoastlines()
> map.drawstates()
> map.readshapefile(shapefile1, 'REGIONS')
> x,y=map(*np.meshgrid(LON,LAT))
> plottitle=ncvariablename+'v_DailyMax'+days
> cmap=plt.cm.jet
> CS = map.contourf(x,y,v_DailyMax, 15, cmap=cmap)
> l,b,w,h =0.1,0.1,0.8,0.8
> cax = plt.axes([l+w+0.025, b, 0.025, h])
> plt.colorbar(CS,cax=cax, drawedges=True)
> plt.savefig((os.path.join(OutputFolder, plottitle+'.png')))
> plt.show()
> plt.close()
>
>

It looks to me like you've jumbled up a bunch of different code 
fragments.  Have you learned about writing functions yet?  Or 
generators?  By making everything global, you're masking a lot of 
mistakes you've made.  For example, v has a value, but just the data 
from the last record of the last file.

I think you need to back off and design this, without worrying much at 
first about how to code it.  It's generally a bad idea to try to collect 
all data before processing it and winnowing it down.  If you know 
something about the filenaming (or directory naming), and can assume 
that all data from one day will be processed before encountering the 
next day, you can greatly simplify things.  Likewise if one hour is 
complete before the next begins.

Is there one file per day?  Is there one record per hour and are they in 
order?  If some such assumptions can be made, then you might be able to 
factor the problem down to a set of less grandiose functions.

Whoever wrote Dataset() has produced some data from a single file that 
you can work with.  What does that data look like, and how much of it 
are you really needing to save past the opening of the next one?

Sometimes when the data is large and not sorted, it pays off to make two 
or more passes through it.  In the extreme, you might take one pass to 
make a sorted list of all the hours involved in all the files.  Then for 
each of those hours, you'd take an additional pass looking for data 
related to that particular hour.  Of course if you're talking years, 
you'll be opening each file many thousands of times.  So if you know 
ANYTHING about the sorting of the data, you can make that more efficient.

I can't help you at all with numpy (presumably what np stands for), or 
the plotting stuff.

-- 
DaveA