Using matplotlib's prctile on masked arrays
Hello, Consider this sample two columns of data: 999999.9999 999999.9999 999999.9999 999999.9999 999999.9999 999999.9999 999999.9999 1693.9069 999999.9999 1676.1059 999999.9999 1621.5875 651.8040 1542.1373 691.0138 1650.4214 678.5558 1710.7311 621.5777 999999.9999 644.8341 999999.9999 696.2080 999999.9999 Putting into this data into a file say "sample.data" and loading with: a,b = np.loadtxt('sample.data', dtype="float").T I[16]: a O[16]: array([ 1.00000000e+06, 1.00000000e+06, 1.00000000e+06, 1.00000000e+06, 1.00000000e+06, 1.00000000e+06, 6.51804000e+02, 6.91013800e+02, 6.78555800e+02, 6.21577700e+02, 6.44834100e+02, 6.96208000e+02]) I[17]: b O[17]: array([ 999999.9999, 999999.9999, 999999.9999, 1693.9069, 1676.1059, 1621.5875, 1542.1373, 1650.4214, 1710.7311, 999999.9999, 999999.9999, 999999.9999]) ### interestingly, the second column is loaded as it is but a values reformed a little. Why this could be happening? Any idea? Anyways, back to masked arrays: I[24]: am = ma.masked_values(a, value=999999.9999) I[25]: am O[25]: masked_array(data = [-- -- -- -- -- -- 651.804 691.0138 678.5558 621.5777 644.8341 696.208], mask = [ True True True True True True False False False False False False], fill_value = 999999.9999) I[30]: bm = ma.masked_values(b, value=999999.9999) I[31]: am O[31]: masked_array(data = [-- -- -- -- -- -- 651.804 691.0138 678.5558 621.5777 644.8341 696.208], mask = [ True True True True True True False False False False False False], fill_value = 999999.9999) So far so good. A few basic checks: I[33]: am/bm O[33]: masked_array(data = [-- -- -- -- -- -- 0.422662755126 0.418689311712 0.39664667346 -- -- --], mask = [ True True True True True True False False False True True True], fill_value = 999999.9999) I[34]: mean(am/bm) O[34]: 0.41266624676580849 Unfortunately, matplotlib.mlab's prctile cannot handle this division: I[54]: prctile(am/bm, p=[5,25,50,75,95]) O[54]: array([ 3.96646673e-01, 6.21577700e+02, 1.00000000e+06, 1.00000000e+06, 1.00000000e+06]) This also results with wrong looking box-and-whisker plots. Testing further with scipy.stats functions yields expected correct results: I[55]: stats.scoreatpercentile(am/bm, per=5) O[55]: 0.40877012449846228 I[49]: stats.scoreatpercentile(am/bm, per=25) O[49]: masked_array(data = --, mask = True, fill_value = 1e+20) I[56]: stats.scoreatpercentile(am/bm, per=95) O[56]: masked_array(data = --, mask = True, fill_value = 1e+20) Any confirmation? -- Gökhan
On Tue, Oct 27, 2009 at 7:56 AM, Gökhan Sever <gokhansever@gmail.com> wrote:
Hello,
Consider this sample two columns of data:
999999.9999 999999.9999 999999.9999 999999.9999 999999.9999 999999.9999 999999.9999 1693.9069 999999.9999 1676.1059 999999.9999 1621.5875 651.8040 1542.1373 691.0138 1650.4214 678.5558 1710.7311 621.5777 999999.9999 644.8341 999999.9999 696.2080 999999.9999
Putting into this data into a file say "sample.data" and loading with:
a,b = np.loadtxt('sample.data', dtype="float").T
I[16]: a O[16]: array([ 1.00000000e+06, 1.00000000e+06, 1.00000000e+06, 1.00000000e+06, 1.00000000e+06, 1.00000000e+06, 6.51804000e+02, 6.91013800e+02, 6.78555800e+02, 6.21577700e+02, 6.44834100e+02, 6.96208000e+02])
I[17]: b O[17]: array([ 999999.9999, 999999.9999, 999999.9999, 1693.9069, 1676.1059, 1621.5875, 1542.1373, 1650.4214, 1710.7311, 999999.9999, 999999.9999, 999999.9999])
### interestingly, the second column is loaded as it is but a values reformed a little. Why this could be happening? Any idea? Anyways, back to masked arrays:
I[24]: am = ma.masked_values(a, value=999999.9999)
I[25]: am O[25]: masked_array(data = [-- -- -- -- -- -- 651.804 691.0138 678.5558 621.5777 644.8341 696.208], mask = [ True True True True True True False False False False False False], fill_value = 999999.9999)
I[30]: bm = ma.masked_values(b, value=999999.9999)
I[31]: am O[31]: masked_array(data = [-- -- -- -- -- -- 651.804 691.0138 678.5558 621.5777 644.8341 696.208], mask = [ True True True True True True False False False False False False], fill_value = 999999.9999)
So far so good. A few basic checks:
I[33]: am/bm O[33]: masked_array(data = [-- -- -- -- -- -- 0.422662755126 0.418689311712 0.39664667346 -- -- --], mask = [ True True True True True True False False False True True True], fill_value = 999999.9999)
I[34]: mean(am/bm) O[34]: 0.41266624676580849
Unfortunately, matplotlib.mlab's prctile cannot handle this division:
I[54]: prctile(am/bm, p=[5,25,50,75,95]) O[54]: array([ 3.96646673e-01, 6.21577700e+02, 1.00000000e+06, 1.00000000e+06, 1.00000000e+06])
This also results with wrong looking box-and-whisker plots.
Testing further with scipy.stats functions yields expected correct results:
This should not be the correct results if you use scipy.stats.scoreatpercentile, it doesn't have correct missing value handling, it treats nans or mask/fill values as regular numbers sorted to the end. stats.mstats.scoreatpercentile is the corresponding function for masked arrays. (BTW I wasn't able to quickly copy and past your example because MaskedArrays don't seem to have a constructive __repr__, i.e. no commas) I don't know anything about the matplotlib story. Josef
I[55]: stats.scoreatpercentile(am/bm, per=5) O[55]: 0.40877012449846228
I[49]: stats.scoreatpercentile(am/bm, per=25) O[49]: masked_array(data = --, mask = True, fill_value = 1e+20)
I[56]: stats.scoreatpercentile(am/bm, per=95) O[56]: masked_array(data = --, mask = True, fill_value = 1e+20)
Any confirmation?
-- Gökhan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Oct 27, 2009 at 8:25 AM, <josef.pktd@gmail.com> wrote:
This should not be the correct results if you use scipy.stats.scoreatpercentile, it doesn't have correct missing value handling, it treats nans or mask/fill values as regular numbers sorted to the end.
stats.mstats.scoreatpercentile is the corresponding function for masked arrays.
Thanks for the suggestion. I forgot the existence of such module. It yields better results. I[14]: st.mstats.scoreatpercentile(r, per=25) O[14]: masked_array(data = 0.401055201111, mask = False, fill_value = 1e+20) I[17]: st.scoreatpercentile(r, per=25) O[17]: masked_array(data = --, mask = True, fill_value = 1e+20) I usually fall into traps using masked arrays. Hopefully I will figure out these before I make funnier mistakes in my analysis. Besides, it would be nice to have the "per" argument accepts a sequence instead of a one item. Like matplotlib's prctile. Using it as: ...(array, per=[5,25,50,75,95]) in a one call.
(BTW I wasn't able to quickly copy and past your example because MaskedArrays don't seem to have a constructive __repr__, i.e. no commas)
You can copy and paste the sample data from this link. When I copied from a txt file into gmail into somehow distorted the original look of the data. http://code.google.com/p/ccnworks/source/browse/trunk/sample.data
I don't know anything about the matplotlib story.
Josef
I[55]: stats.scoreatpercentile(am/bm, per=5) O[55]: 0.40877012449846228
I[49]: stats.scoreatpercentile(am/bm, per=25) O[49]: masked_array(data = --, mask = True, fill_value = 1e+20)
I[56]: stats.scoreatpercentile(am/bm, per=95) O[56]: masked_array(data = --, mask = True, fill_value = 1e+20)
Any confirmation?
-- Gökhan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Gökhan
On Oct 27, 2009, at 7:56 AM, Gökhan Sever wrote:
Unfortunately, matplotlib.mlab's prctile cannot handle this division:
Actually, the division's OK, it's mlab.prctile which is borked. It uses the length of the input array instead of its count to compute the nb of valid data. The easiest workaround in your case is probably to use:
prctile((am/bm).compressed(), p=[5,25,50,75,95]) HIH P.
On Tue, Oct 27, 2009 at 12:23 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
On Oct 27, 2009, at 7:56 AM, Gökhan Sever wrote:
Unfortunately, matplotlib.mlab's prctile cannot handle this division:
Actually, the division's OK, it's mlab.prctile which is borked. It uses the length of the input array instead of its count to compute the nb of valid data. The easiest workaround in your case is probably to use:
prctile((am/bm).compressed(), p=[5,25,50,75,95]) HIH P.
Great. Exact solution. I should have asked this last week :) One simple method solves all the riddle. I had manually masked the MVCs using NaN's. My guess is using compressed() masked arrays could be used with any of regularly defined numpy and scipy functions, right? Thanks for the tip.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Gökhan
On Wed, Oct 28, 2009 at 9:52 AM, Gökhan Sever <gokhansever@gmail.com> wrote:
On Tue, Oct 27, 2009 at 12:23 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
On Oct 27, 2009, at 7:56 AM, Gökhan Sever wrote:
Unfortunately, matplotlib.mlab's prctile cannot handle this division:
Actually, the division's OK, it's mlab.prctile which is borked. It uses the length of the input array instead of its count to compute the nb of valid data. The easiest workaround in your case is probably to use: >>> prctile((am/bm).compressed(), p=[5,25,50,75,95]) HIH P.
Great. Exact solution. I should have asked this last week :)
One simple method solves all the riddle. I had manually masked the MVCs using NaN's.
My guess is using compressed() masked arrays could be used with any of regularly defined numpy and scipy functions, right?
Yes, however it only works for 1d or with ravel(). You cannot compress a 2d array, and preserve a rectangular shape (with unequal numbers of missing numbers.) I some cases removing rows or columns with missing values might be more appropriate, or finding a "neutral" fill value. Josef
Thanks for the tip.
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Gökhan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (3)
-
Gökhan Sever
-
josef.pktd@gmail.com
-
Pierre GM