[Tutor] print date and skip any repeats
Dave Angel
davea at davea.name
Thu Oct 2 19:35:20 CEST 2014
On 09/24/2014 05:19 AM, questions anon wrote:
> Ok, I am continuing to get stuck. I think I was asking the wrong question
> so I have posted the entire script (see below).
> What I really want to do is find the daily maximum for a dataset (e.g.
> Temperature) that is contained in monthly netcdf files where the data are
> separated by hour.
> The major steps are:
> open monthly netcdf files and use the timestamp to extract the hourly data
> for the first day.
> Append the data for each hour of that day to a list, concatenate, find max
> and plot
> then loop through and do the following day.
>
> I can do some of the steps separately but I run into trouble in working out
> how to loop through each hour and separate the data into each day and then
> repeat all the steps for the following day.
>
> Any feedback will be greatly appreciated!
>
>
This is NOT the whole program. You don't seem to define ncvariablename,
Dataset, np, Basemap, plt, and probably others.
>
> oneday=[]
> all_the_dates=[]
> onedateperday=[]
>
>
> #open folders and open relevant netcdf files that are monthly files
> containing hourly data across a region
> for (path, dirs, files) in os.walk(MainFolder):
> for ncfile in files:
If there are more than one file, or more than one directory, then there
are no guarantees that this will give you the files in the order you are
likely to want. You may need to do some sorting to make it work out.
But after further study, you seem to assume you'll read everything in
from all the files, and then process it. Have you studied how big that
might get, and whether you'll have enough memory to make it viable?
> if ncfile.endswith(ncvariablename+'.nc'):
> print "dealing with ncfiles:", path+ncfile
> ncfile=os.path.join(path,ncfile)
> ncfile=Dataset(ncfile, 'r+', 'NETCDF4')
> variable=ncfile.variables[ncvariablename][:,:,:]
> TIME=ncfile.variables['time'][:]
Since you've given ncfile MANY different meanings, this close() will not
close the file. Variables are cheap, use lots of them, and give them
good names. In this particular case, maybe using a with statement would
make sense.
> ncfile.close()
>
> #combine variable and time so I can make calculations based
> on hours or days
> for v, time in zip((variable[:,:,:]),(TIME[:])):
>
> cdftime=utime('seconds since 1970-01-01 00:00:00')
> ncfiletime=cdftime.num2date(time)
> timestr=str(ncfiletime)
> utc_dt = dt.strptime(timestr, '%Y-%m-%d %H:%M:%S')
> au_tz = pytz.timezone('Australia/Sydney')
> local_dt =
> utc_dt.replace(tzinfo=pytz.utc).astimezone(au_tz)
> strp_local=local_dt.strftime('%Y-%m-%d_%H') #strips
> time down to date and hour
> local_date=local_dt.strftime('%Y-%m-%d') #strips time
> down to just date
>
> all_the_dates.append(local_date)
>
> #make a list that strips down the dates to only have one date per day
> (rather than one for each hour)
> onedateperday = sorted ( list (set (all_the_dates)))
>
> #loop through each day and combine data (v) where the hours occur on the
> same day
> for days in onedateperday:
> if strp_local.startswith(days):
> oneday.append(v)
>
And just where does v come from? You're no longer in the loop that says
"for v, time in..."
> big_array=np.ma.dstack(oneday) #concatenate data
> v_DailyMax=big_array.max(axis=2) # find max
>
> #then go on to plot v_DailyMax for each day
> map = Basemap(projection='merc',llcrnrlat=-40,urcrnrlat=-33,
> llcrnrlon=139.0,urcrnrlon=151.0,lat_ts=0,resolution='i')
> map.drawcoastlines()
> map.drawstates()
> map.readshapefile(shapefile1, 'REGIONS')
> x,y=map(*np.meshgrid(LON,LAT))
> plottitle=ncvariablename+'v_DailyMax'+days
> cmap=plt.cm.jet
> CS = map.contourf(x,y,v_DailyMax, 15, cmap=cmap)
> l,b,w,h =0.1,0.1,0.8,0.8
> cax = plt.axes([l+w+0.025, b, 0.025, h])
> plt.colorbar(CS,cax=cax, drawedges=True)
> plt.savefig((os.path.join(OutputFolder, plottitle+'.png')))
> plt.show()
> plt.close()
>
>
It looks to me like you've jumbled up a bunch of different code
fragments. Have you learned about writing functions yet? Or
generators? By making everything global, you're masking a lot of
mistakes you've made. For example, v has a value, but just the data
from the last record of the last file.
I think you need to back off and design this, without worrying much at
first about how to code it. It's generally a bad idea to try to collect
all data before processing it and winnowing it down. If you know
something about the filenaming (or directory naming), and can assume
that all data from one day will be processed before encountering the
next day, you can greatly simplify things. Likewise if one hour is
complete before the next begins.
Is there one file per day? Is there one record per hour and are they in
order? If some such assumptions can be made, then you might be able to
factor the problem down to a set of less grandiose functions.
Whoever wrote Dataset() has produced some data from a single file that
you can work with. What does that data look like, and how much of it
are you really needing to save past the opening of the next one?
Sometimes when the data is large and not sorted, it pays off to make two
or more passes through it. In the extreme, you might take one pass to
make a sorted list of all the hours involved in all the files. Then for
each of those hours, you'd take an additional pass looking for data
related to that particular hour. Of course if you're talking years,
you'll be opening each file many thousands of times. So if you know
ANYTHING about the sorting of the data, you can make that more efficient.
I can't help you at all with numpy (presumably what np stands for), or
the plotting stuff.
--
DaveA
More information about the Tutor
mailing list