Note that if you are ok with an approximate solution, and you can assume your data is somewhat shuffled, a simple online algorithm that uses no memory consists in:<br>- choosing a small step size delta<br>- initializing your percentile p to a more or less random value (a meaningful guess is better though)<br>

- iterate through your samples, updating p after each sample by p += 19 * delta if sample > p, and p -= delta otherwise<br><br>The idea is that the 95th percentile is such that 5% of the data is higher, and 95% (19 times more) is lower, so if p is equal to this value, on average it should remain constant through the online update.<br>

You may do multiple passes if you are not confident in your initial value, possibly reducing delta over time to improve accuracy.<br><br>-=- Olivier<br><br><div class="gmail_quote">2012/1/24 questions anon <span dir="ltr"><<a href="mailto:questions.anon@gmail.com">questions.anon@gmail.com</a>></span><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">thanks for your responses,<br>because of the size of the dataset I will 

still end up with the memory error if I calculate the median for each 

file, additionally the files are not all the same size. I believe this memory problem will still arise with the cumulative distribution calculation and not sure I understand how to write the second suggestion about the iterative approach but will have a go.<br>


Thanks again<div class="HOEnZb"><div class="h5"><br><br><div class="gmail_quote">On Wed, Jan 25, 2012 at 1:26 PM, Brett Olsen <span dir="ltr"><<a href="mailto:brett.olsen@gmail.com" target="_blank">brett.olsen@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div>On Tue, Jan 24, 2012 at 6:22 PM, questions anon<br>

<<a href="mailto:questions.anon@gmail.com" target="_blank">questions.anon@gmail.com</a>> wrote:<br>

</div><div><div>> I need some help understanding how to loop through many arrays to calculate<br>

> the 95th percentile.<br>

> I can easily do this by using numpy.concatenate to make one big array and<br>

> then finding the 95th percentile using numpy.percentile but this causes a<br>

> memory error when I want to run this on 100's of netcdf files (see code<br>

> below).<br>

> Any alternative methods will be greatly appreciated.<br>

><br>

><br>

> all_TSFC=[]<br>

> for (path, dirs, files) in os.walk(MainFolder):<br>

>     for dir in dirs:<br>

>         print dir<br>

>     path=path+'/'<br>

>     for ncfile in files:<br>

>         if ncfile[-3:]=='.nc':<br>

>             print "dealing with ncfiles:", ncfile<br>

>             ncfile=os.path.join(path,ncfile)<br>

>             ncfile=Dataset(ncfile, 'r+', 'NETCDF4')<br>

>             TSFC=ncfile.variables['T_SFC'][:]<br>

>             ncfile.close()<br>

>             all_TSFC.append(TSFC)<br>

><br>

> big_array=N.ma.concatenate(all_TSFC)<br>

> Percentile95th=N.percentile(big_array, 95, axis=0)<br>

<br>

</div></div>If the range of your data is known and limited (i.e., you have a<br>

comparatively small number of possible values, but a number of repeats<br>

of each value) then you could do this by keeping a running cumulative<br>

distribution function as you go through each of your files.  For each<br>

file, calculate a cumulative distribution function --- at each<br>

possible value, record the fraction of that population strictly less<br>

than that value --- and then it's straightforward to combine the<br>

cumulative distribution functions from two separate files:<br>

cumdist_both = (cumdist1 * N1 + cumdist2 * N2) / (N1 + N2)<br>

<br>

Then once you've gone through all the files, look for the value where<br>

your cumulative distribution function is equal to 0.95.  If your data<br>

isn't structured with repeated values, though, this won't work,<br>

because your cumulative distribution function will become too big to<br>

hold into memory.  In that case, what I would probably do would be an<br>

iterative approach:  make an approximation to the exact function by<br>

removing some fraction of the possible values, which will provide a<br>

limited range for the exact percentile you want, and then walk through<br>

the files again calculating the function more exactly within the<br>

limited range, repeating until you have the value to the desired<br>

precision.<br>

<span><font color="#888888"><br>

~Brett<br>

</font></span><div><div>_______________________________________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@scipy.org" target="_blank">NumPy-Discussion@scipy.org</a><br>

<a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

</div></div></blockquote></div><br>

</div></div><br>_______________________________________________<br>

NumPy-Discussion mailing list<br>

<a href="mailto:NumPy-Discussion@scipy.org">NumPy-Discussion@scipy.org</a><br>

<a href="http://mail.scipy.org/mailman/listinfo/numpy-discussion" target="_blank">http://mail.scipy.org/mailman/listinfo/numpy-discussion</a><br>

<br></blockquote></div><br>