
I need some help understanding how to loop through many arrays to calculate the 95th percentile. I can easily do this by using numpy.concatenate to make one big array and then finding the 95th percentile using numpy.percentile but this causes a memory error when I want to run this on 100's of netcdf files (see code below). Any alternative methods will be greatly appreciated. all_TSFC=[] for (path, dirs, files) in os.walk(MainFolder): for dir in dirs: print dir path=path+'/' for ncfile in files: if ncfile[-3:]=='.nc': print "dealing with ncfiles:", ncfile ncfile=os.path.join(path,ncfile) ncfile=Dataset(ncfile, 'r+', 'NETCDF4') TSFC=ncfile.variables['T_SFC'][:] ncfile.close() all_TSFC.append(TSFC) big_array=N.ma.concatenate(all_TSFC) Percentile95th=N.percentile(big_array, 95, axis=0)

This is probably not the best way to do it, but I think it would work: Your could take two passes through your data, first calculating and storing the median for each file and the number of elements in each file. From those data, you can get a lower bound on the 95th percentile of the combined dataset. For example, if all the files are the same size, and you've got 100 of them, then the 95th percentile of the full dataset would be at least as large as the 90th percentile of the individual file median values. Once you've got that cut-off value, go back through your files and just pull out the values larger than your cut-off value. Then you'd just need to figure out what percentile in this subset would correspond to the 95th percentile in the full dataset. HTH, Marc On Tue, Jan 24, 2012 at 7:22 PM, questions anon <questions.anon@gmail.com>wrote:
I need some help understanding how to loop through many arrays to calculate the 95th percentile. I can easily do this by using numpy.concatenate to make one big array and then finding the 95th percentile using numpy.percentile but this causes a memory error when I want to run this on 100's of netcdf files (see code below). Any alternative methods will be greatly appreciated.
all_TSFC=[] for (path, dirs, files) in os.walk(MainFolder): for dir in dirs: print dir path=path+'/' for ncfile in files: if ncfile[-3:]=='.nc': print "dealing with ncfiles:", ncfile ncfile=os.path.join(path,ncfile) ncfile=Dataset(ncfile, 'r+', 'NETCDF4') TSFC=ncfile.variables['T_SFC'][:] ncfile.close() all_TSFC.append(TSFC)
big_array=N.ma.concatenate(all_TSFC) Percentile95th=N.percentile(big_array, 95, axis=0)
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

On Tue, Jan 24, 2012 at 6:22 PM, questions anon <questions.anon@gmail.com> wrote:
I need some help understanding how to loop through many arrays to calculate the 95th percentile. I can easily do this by using numpy.concatenate to make one big array and then finding the 95th percentile using numpy.percentile but this causes a memory error when I want to run this on 100's of netcdf files (see code below). Any alternative methods will be greatly appreciated.
all_TSFC=[] for (path, dirs, files) in os.walk(MainFolder): for dir in dirs: print dir path=path+'/' for ncfile in files: if ncfile[-3:]=='.nc': print "dealing with ncfiles:", ncfile ncfile=os.path.join(path,ncfile) ncfile=Dataset(ncfile, 'r+', 'NETCDF4') TSFC=ncfile.variables['T_SFC'][:] ncfile.close() all_TSFC.append(TSFC)
big_array=N.ma.concatenate(all_TSFC) Percentile95th=N.percentile(big_array, 95, axis=0)
If the range of your data is known and limited (i.e., you have a comparatively small number of possible values, but a number of repeats of each value) then you could do this by keeping a running cumulative distribution function as you go through each of your files. For each file, calculate a cumulative distribution function --- at each possible value, record the fraction of that population strictly less than that value --- and then it's straightforward to combine the cumulative distribution functions from two separate files: cumdist_both = (cumdist1 * N1 + cumdist2 * N2) / (N1 + N2) Then once you've gone through all the files, look for the value where your cumulative distribution function is equal to 0.95. If your data isn't structured with repeated values, though, this won't work, because your cumulative distribution function will become too big to hold into memory. In that case, what I would probably do would be an iterative approach: make an approximation to the exact function by removing some fraction of the possible values, which will provide a limited range for the exact percentile you want, and then walk through the files again calculating the function more exactly within the limited range, repeating until you have the value to the desired precision. ~Brett

thanks for your responses, because of the size of the dataset I will still end up with the memory error if I calculate the median for each file, additionally the files are not all the same size. I believe this memory problem will still arise with the cumulative distribution calculation and not sure I understand how to write the second suggestion about the iterative approach but will have a go. Thanks again On Wed, Jan 25, 2012 at 1:26 PM, Brett Olsen <brett.olsen@gmail.com> wrote:
On Tue, Jan 24, 2012 at 6:22 PM, questions anon <questions.anon@gmail.com> wrote:
I need some help understanding how to loop through many arrays to calculate the 95th percentile. I can easily do this by using numpy.concatenate to make one big array and then finding the 95th percentile using numpy.percentile but this causes a memory error when I want to run this on 100's of netcdf files (see code below). Any alternative methods will be greatly appreciated.
all_TSFC=[] for (path, dirs, files) in os.walk(MainFolder): for dir in dirs: print dir path=path+'/' for ncfile in files: if ncfile[-3:]=='.nc': print "dealing with ncfiles:", ncfile ncfile=os.path.join(path,ncfile) ncfile=Dataset(ncfile, 'r+', 'NETCDF4') TSFC=ncfile.variables['T_SFC'][:] ncfile.close() all_TSFC.append(TSFC)
big_array=N.ma.concatenate(all_TSFC) Percentile95th=N.percentile(big_array, 95, axis=0)
If the range of your data is known and limited (i.e., you have a comparatively small number of possible values, but a number of repeats of each value) then you could do this by keeping a running cumulative distribution function as you go through each of your files. For each file, calculate a cumulative distribution function --- at each possible value, record the fraction of that population strictly less than that value --- and then it's straightforward to combine the cumulative distribution functions from two separate files: cumdist_both = (cumdist1 * N1 + cumdist2 * N2) / (N1 + N2)
Then once you've gone through all the files, look for the value where your cumulative distribution function is equal to 0.95. If your data isn't structured with repeated values, though, this won't work, because your cumulative distribution function will become too big to hold into memory. In that case, what I would probably do would be an iterative approach: make an approximation to the exact function by removing some fraction of the possible values, which will provide a limited range for the exact percentile you want, and then walk through the files again calculating the function more exactly within the limited range, repeating until you have the value to the desired precision.
~Brett _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Note that if you are ok with an approximate solution, and you can assume your data is somewhat shuffled, a simple online algorithm that uses no memory consists in: - choosing a small step size delta - initializing your percentile p to a more or less random value (a meaningful guess is better though) - iterate through your samples, updating p after each sample by p += 19 * delta if sample > p, and p -= delta otherwise The idea is that the 95th percentile is such that 5% of the data is higher, and 95% (19 times more) is lower, so if p is equal to this value, on average it should remain constant through the online update. You may do multiple passes if you are not confident in your initial value, possibly reducing delta over time to improve accuracy. -=- Olivier 2012/1/24 questions anon <questions.anon@gmail.com>
thanks for your responses, because of the size of the dataset I will still end up with the memory error if I calculate the median for each file, additionally the files are not all the same size. I believe this memory problem will still arise with the cumulative distribution calculation and not sure I understand how to write the second suggestion about the iterative approach but will have a go. Thanks again
On Wed, Jan 25, 2012 at 1:26 PM, Brett Olsen <brett.olsen@gmail.com>wrote:
On Tue, Jan 24, 2012 at 6:22 PM, questions anon <questions.anon@gmail.com> wrote:
I need some help understanding how to loop through many arrays to calculate the 95th percentile. I can easily do this by using numpy.concatenate to make one big array and then finding the 95th percentile using numpy.percentile but this causes a memory error when I want to run this on 100's of netcdf files (see code below). Any alternative methods will be greatly appreciated.
all_TSFC=[] for (path, dirs, files) in os.walk(MainFolder): for dir in dirs: print dir path=path+'/' for ncfile in files: if ncfile[-3:]=='.nc': print "dealing with ncfiles:", ncfile ncfile=os.path.join(path,ncfile) ncfile=Dataset(ncfile, 'r+', 'NETCDF4') TSFC=ncfile.variables['T_SFC'][:] ncfile.close() all_TSFC.append(TSFC)
big_array=N.ma.concatenate(all_TSFC) Percentile95th=N.percentile(big_array, 95, axis=0)
If the range of your data is known and limited (i.e., you have a comparatively small number of possible values, but a number of repeats of each value) then you could do this by keeping a running cumulative distribution function as you go through each of your files. For each file, calculate a cumulative distribution function --- at each possible value, record the fraction of that population strictly less than that value --- and then it's straightforward to combine the cumulative distribution functions from two separate files: cumdist_both = (cumdist1 * N1 + cumdist2 * N2) / (N1 + N2)
Then once you've gone through all the files, look for the value where your cumulative distribution function is equal to 0.95. If your data isn't structured with repeated values, though, this won't work, because your cumulative distribution function will become too big to hold into memory. In that case, what I would probably do would be an iterative approach: make an approximation to the exact function by removing some fraction of the possible values, which will provide a limited range for the exact percentile you want, and then walk through the files again calculating the function more exactly within the limited range, repeating until you have the value to the desired precision.
~Brett _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (4)
-
Brett Olsen
-
Marc Shivers
-
Olivier Delalleau
-
questions anon