Memory usage of scipy.io.loadmat
![](https://secure.gravatar.com/avatar/f75a7cfe036871a9a205a2fe1a935a8c.jpg?s=120&d=mm&r=g)
Hi everyone, I have stumbled on some interesting behavior of scipy.io.loadmat. The short of it - it looks like loadmat is gobbling up memory in some unjustified manner and releasing it under some strange circumstances. Here's the long story. This is all happening on a Mac OS 10.5.7, running EPD 4.0.30001 (Python 2.5.2), but with a relatively new version of scipy (see below). I start ipython -pylab in one terminal and run 'top' in another, in order to monitor the memory usage. Here's what I get initially: PhysMem: 417M wired, 449M active, 183M inactive, 1055M used, 3041M free. Then: In [1]: import scipy In [2]: scipy.__version__ Out[2]: '0.8.0.dev5606' In [3]: import scipy.io as sio Here's what it looks like now: PhysMem: 418M wired, 450M active, 183M inactive, 1058M used, 3038M free. So far, so good. I read in a large matfile with tons of data in it: In [4]: a = sio.loadmat('/Users/arokem/Projects/SchizoSpread/Scans/SMR033109_MC/Gray/Original/TSeries/Scan1/tSeries1.mat') PhysMem: 419M wired, 1024M active, 183M inactive, 1632M used, 2464M free. So - about 600 MB of memory is taken up by this new variable. Now to the wierdness: In [5]: b --------------------------------------------------------------------------- NameError Traceback (most recent call last) /Users/arokem/<ipython console> in <module>() NameError: name 'b' is not defined Of course - 'b' doesn't exist! But now, the memory usage has dramatically gone down: PhysMem: 420M wired, 740M active, 183M inactive, 1350M used, 2746M free. So - just invoking an error in the ipython command line has freed up 300 MB. Where did they come from? I tried different things - assigning other variables doesn't seem to free up this memory. Neither do calls to other functions. Except "plot()", which does seem to do the trick for some reason. Interestingly, when I run all this in a python interactive session (and not ipython), I get a similar memory usage initially. Calling a non-existent variable does not free up the memory, but other things do. For example, import matplotlib.pylab into the namespace did the trick. Does anyone have any idea what is going on? Thanks -- Ariel -- Ariel Rokem Helen Wills Neuroscience Institute University of California, Berkeley http://argentum.ucbso.berkeley.edu/ariel
![](https://secure.gravatar.com/avatar/d416648d180f35208c9c3c878657086f.jpg?s=120&d=mm&r=g)
So - just invoking an error in the ipython command line has freed up 300 MB. Where did they come from? I tried different things - assigning other variables doesn't seem to free up this memory. Neither do calls to other functions. Except "plot()", which does seem to do the trick for some reason. Interestingly, when I run all this in a python interactive session (and not ipython), I get a similar memory usage initially. Calling a non-existent variable does not free up the memory, but other things do. For example, import matplotlib.pylab into the namespace did the trick. Does anyone have any idea what is going on?
It is probably just the garbage collector being invoked. If you invoke it manually, does it always free memory? e.g.: import gc gc.collect() Xavier Saint-Mleux
![](https://secure.gravatar.com/avatar/f75a7cfe036871a9a205a2fe1a935a8c.jpg?s=120&d=mm&r=g)
Yes - that does seem to free up the memory. While running this: In [11]: for i in range(10): a = sio.loadmat('/Users/arokem/Projects/SchizoSpread/Scans/SMR033109_MC/Gray/Original/TSeries/Scan1/tSeries1.mat') causes a memory error, running this: In [14]: for i in range(10): a = sio.loadmat('/Users/arokem/Projects/SchizoSpread/Scans/SMR033109_MC/Gray/Original/TSeries/Scan1/tSeries1.mat') gc.collect() seems like it could go on forever (looking at the memory usage on a memory monitor, it just goes up and down to the same point, without net accumulation). Thanks a lot! Ariel On Tue, Jun 30, 2009 at 3:33 PM, Xavier Saint-Mleux <saintmlx@apstat.com>wrote:
So - just invoking an error in the ipython command line has freed up 300 MB. Where did they come from? I tried different things - assigning other variables doesn't seem to free up this memory. Neither do calls to other functions. Except "plot()", which does seem to do the trick for some reason. Interestingly, when I run all this in a python interactive session (and not ipython), I get a similar memory usage initially. Calling a non-existent variable does not free up the memory, but other things do. For example, import matplotlib.pylab into the namespace did the trick. Does anyone have any idea what is going on?
It is probably just the garbage collector being invoked. If you invoke it manually, does it always free memory? e.g.:
import gc
gc.collect()
Xavier Saint-Mleux
_______________________________________________ Scipy-dev mailing list Scipy-dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
-- Ariel Rokem Helen Wills Neuroscience Institute University of California, Berkeley http://argentum.ucbso.berkeley.edu/ariel
![](https://secure.gravatar.com/avatar/48108f2323e8f2d8d5118db57eb09daf.jpg?s=120&d=mm&r=g)
If I don't want to be searching through the source code - where should I be looking for up-to-date documentation? Searching via Google for the documentation on scipy.stats Mann Whitney U test I found: http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#mannwhitneyu http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mannw hitneyu.html Both seem outdated compared to what I found here: http://svn.scipy.org/svn/scipy/trunk/scipy/stats/stats.py http://projects.scipy.org/scipy/browser/tags/0.7.1/scipy/stats/stats.py Thank you, Elias
![](https://secure.gravatar.com/avatar/59bdb3784070f0a6836aca9ee03ad817.jpg?s=120&d=mm&r=g)
On Thu, Jul 2, 2009 at 1:18 AM, Elias Pampalk<elias.pampalk@gmail.com> wrote:
If I don't want to be searching through the source code - where should I be looking for up-to-date documentation?
The most uptodate documentation is almost always the docstring. If you use an advanced interpreter such as ipython, you get the docstring without looking at the sources: import numpy as np help np.mean There is a current effort to bring the official documentation on par with the docstring, but I don't think it has been done completely for scipy.stats yet. David
![](https://secure.gravatar.com/avatar/48108f2323e8f2d8d5118db57eb09daf.jpg?s=120&d=mm&r=g)
Why does stats.py contain both mannwhitneyu and ranksums? See also: http://en.wikipedia.org/wiki/Mann-Whitney-Wilcoxon_test "Mann-Whitney U test (also called the Mann-Whitney-Wilcoxon (MWW), Wilcoxon rank-sum test, or Wilcoxon-Mann-Whitney test)" http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/acce ss/helpdesk/help/toolbox/stats/ranksum.html "ranksum [...] The test is equivalent to a Mann-Whitney U-test." Thanks, Elias PS. @David, @Pauli: Thank you for answering my question wrt documentation yesterday!
![](https://secure.gravatar.com/avatar/b684c02bab6c8d54c0c25c4b69ee1135.jpg?s=120&d=mm&r=g)
There's also wilcoxon, which is related but subtly different. There's an open ticket for clarifying these docs: http://projects.scipy.org/scipy/ticket/901 The discussion there might illuminate things. David On 2-Jul-09, at 1:43 PM, Elias Pampalk wrote:
Why does stats.py contain both mannwhitneyu and ranksums?
See also: http://en.wikipedia.org/wiki/Mann-Whitney-Wilcoxon_test
"Mann-Whitney U test (also called the Mann-Whitney-Wilcoxon (MWW), Wilcoxon rank-sum test, or Wilcoxon-Mann-Whitney test)"
http://www.mathworks.com/access/helpdesk/help/toolbox/stats/index.html?/acce ss/helpdesk/help/toolbox/stats/ranksum.html
"ranksum [...] The test is equivalent to a Mann-Whitney U-test."
Thanks, Elias
PS. @David, @Pauli: Thank you for answering my question wrt documentation yesterday!
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
![](https://secure.gravatar.com/avatar/48108f2323e8f2d8d5118db57eb09daf.jpg?s=120&d=mm&r=g)
Thanks David! I did a quick comparison between Matlab/stats (R14SP3), R (2.8.1), and Python/SciPy (0.7). Maybe this is somehow useful for others too. (I'm intentionally violating the continuous distribution assumptions.) Samples: A1 <-> B: not paired with ties A2 <-> B: not paired without ties A1 <-> C: paired with zeros A2 <-> C: paired without zeros - Matlab A1 = 0:19 A2 = A1 + (1:20)./100 B = 0:39 C = [0:14,16:20] - R A1 <- 0:19 A2 <- A1 + 1:20/100 B <- 0:39 C <- c(0:14,16:20) - SciPy A1 = numpy.arange(20) A2 = A1 + numpy.arange(1,21)/100.0 B = numpy.arange(40) C = numpy.array(range(15) + range(16,21)) 2 Samples, Not Paired ===================== (from scipy.stats import stats) Kruskal-Wallis Test ------------------- Same p-values for all. Samples contain ties: - Matlab: kruskalwallis([A1,B],[A1*0,B*0+1]) = 0.00170615101265 - R: kruskal.test(list(A1,B)) = 0.00170615101265 - R: wilcox.test(A1,B, correct=FALSE) = 0.00170615101265 (+warning: ties) - SciPy: stats.kruskal(A1,B) = 0.00170615101265 (R: kruskal = wilcox without correction for continuity) Samples without ties: - Matlab: kruskalwallis([A2,B], [A2*0,B*0+1]) = 0.00288777919292 - R: kruskal.test(list(A2,B)) = 0.00288777919292 - SciPy: stats.kruskal(A2,B) = 0.00288777919292 Wilcoxon Rank Sum (aka Mann Whitney U) Test ------------------------------------------- Matlab and R identical (but different defaults wrt exact/approximate), SciPy computes approximate results and does not correct for continuity (changed in version 7.1 for stats.mannwhitneyu?). Samples contain ties: - Matlab: ranksum(A1,B) = 0.00175235702866 - R: wilcox.test(A1,B) = 0.00175235702866 (+warning: ties) - R: wilcox.test(A1,B,correct=FALSE) = 0.001706151012654 (+warning: ties) - SciPy: stats.mannwhitneyu(A1,B)[1]*2 = 0.0017086895586986284 - SciPy: stats.ranksums(A1,B) = 0.0017112312247389294 Samples without ties: - Matlab: ranksum(A2,B) = 0.00296255173431 - R: wilcox.test(A2,B, exact=FALSE) = 0.00296255173431 - Matlab: ranksum(A2,B,'method','exact') = 0.00246078580826 - R: wilcox.test(A2,B) = 0.00246078580826 - R: wilcox.test(A2,B, exact=FALSE, correct=FALSE) = 0.00288777919292 - SciPy: stats.mannwhitneyu(A2,B)[1]*2 = 0.00288777919292 - SciPy: stats.ranksums(A2,B) = 0.00288777919292 (SciPy: mannwhitneyu = ranksums = kruskal if no ties) 2 Samples, Paired, Wilcoxon Sign Rank Test ========================================== (from scipy.stats import wilcoxon) Matlab and SciPy do not correct for continuity and R does. Matlab and R have different defaults for exact/approximate. Matlab computes exact results also if ties/zeros exist. With zeros: - Matlab: signrank(A1,C,'method','approximate') = 0.02534731867747 - R: wilcox.test(A1 - C, correct=FALSE) = 0.02534731867747 (+warnings: ties + zeros) - Matlab: signrank(A1,C) = 0.06250000000000 - R: wilcox.test(A1 - C) = 0.0368884257070 (+warnings: ties + zeros) - SciPy: wilcoxon(A1,C) = nan (+error: sample size too small) Without zeros: - Matlab: signrank(A2,C,'method','exact') = 0.59581947326660 - R: wilcox.test(A2 - C) = 0.59581947326660 - Matlab: signrank(A2,C) = 0.57548622813650 - R: wilcox.test(A2 - C, exact=FALSE, correct=FALSE) = 0.57548622813650 - SciPy: wilcoxon(A2,C) = 0.57548622813650 - R: wilcox.test(A2 - C, exact=FALSE) = 0.5882844808893 Elias
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Fri, Jul 3, 2009 at 12:06 PM, Elias Pampalk<elias.pampalk@gmail.com> wrote:
I did a quick comparison between Matlab/stats (R14SP3), R (2.8.1), and Python/SciPy (0.7). Maybe this is somehow useful for others too.
...
Elias
Thanks for doing this, this is very helpful. I attached the comparison to ticket:901 Except for one case, the numbers look pretty good. However, documentation is still weak, and some of the code duplication could be removed. Josef (I didn't look at it closely before, because I was on vacation at that time.)
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Fri, Aug 14, 2009 at 10:13 AM, <josef.pktd@gmail.com> wrote:
On Fri, Jul 3, 2009 at 12:06 PM, Elias Pampalk<elias.pampalk@gmail.com> wrote:
I did a quick comparison between Matlab/stats (R14SP3), R (2.8.1), and Python/SciPy (0.7). Maybe this is somehow useful for others too.
...
Elias
Thanks for doing this, this is very helpful. I attached the comparison to ticket:901
Except for one case, the numbers look pretty good. However, documentation is still weak, and some of the code duplication could be removed.
Josef
(I didn't look at it closely before, because I was on vacation at that time.)
I wish we would get more contributions like this, there are still some unverified functions in stats (and the rest of scipy). Skipper and I (mostly Skipper) spend a lot of time this summer comparing the reworked models code with R, Stata or SAS, and it can be very time consuming to find out whether the differences are bugs, are because of the use of different options or whether the functions are based on different definitions. The good result is that we have almost all the code verified against at least one other statistical package. Josef
![](https://secure.gravatar.com/avatar/5c9fb379c4e97b58960d74dcbfc5dee5.jpg?s=120&d=mm&r=g)
On Fri, Aug 14, 2009 at 11:49:37AM -0400, josef.pktd@gmail.com wrote:
Skipper and I (mostly Skipper) spend a lot of time this summer comparing the reworked models code with R, Stata or SAS, and it can be very time consuming to find out whether the differences are bugs, are because of the use of different options or whether the functions are based on different definitions. The good result is that we have almost all the code verified against at least one other statistical package.
Thank you so much guys. I often cite your work to colleagues as an example of why you would want to open source some code: some random guys might come later and actually verify it (and find that you've done a lot of bad things). Gaël
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
Josef, I believe we had an discussion about various versions of Wilcoxon and Mann-Whitney some months ago. I find a discussion of this from february. We (or I alone?) also wrote Cython versions of the test, which I cannot find now. I should have filed a ticket then. :-( Regards, Sturla Molden josef.pktd@gmail.com skrev:
On Fri, Jul 3, 2009 at 12:06 PM, Elias Pampalk<elias.pampalk@gmail.com> wrote:
I did a quick comparison between Matlab/stats (R14SP3), R (2.8.1), and Python/SciPy (0.7). Maybe this is somehow useful for others too.
...
Elias
Thanks for doing this, this is very helpful. I attached the comparison to ticket:901
Except for one case, the numbers look pretty good. However, documentation is still weak, and some of the code duplication could be removed.
Josef
(I didn't look at it closely before, because I was on vacation at that time.) _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
My memory serves me badly, that was Kendall's tau. It is still pending review though. http://projects.scipy.org/scipy/ticket/893 Sturla Sturla Molden skrev:
Josef,
I believe we had an discussion about various versions of Wilcoxon and Mann-Whitney some months ago. I find a discussion of this from february. We (or I alone?) also wrote Cython versions of the test, which I cannot find now. I should have filed a ticket then. :-(
Regards, Sturla Molden
josef.pktd@gmail.com skrev:
On Fri, Jul 3, 2009 at 12:06 PM, Elias Pampalk<elias.pampalk@gmail.com> wrote:
I did a quick comparison between Matlab/stats (R14SP3), R (2.8.1), and Python/SciPy (0.7). Maybe this is somehow useful for others too.
...
Elias
Thanks for doing this, this is very helpful. I attached the comparison to ticket:901
Except for one case, the numbers look pretty good. However, documentation is still weak, and some of the code duplication could be removed.
Josef
(I didn't look at it closely before, because I was on vacation at that time.) _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Sun, Aug 16, 2009 at 8:54 PM, Sturla Molden<sturla@molden.no> wrote:
My memory serves me badly, that was Kendall's tau. It is still pending review though.
http://projects.scipy.org/scipy/ticket/893
Sturla
I'm aware of the waiting list for stats enhancement tickets, but I'm only slowly getting used to assigning correct labels. I should have moved Ticket 893 to the "needs work" status. The main part, that prevents it from quick inclusion, is the missing variance calculation to be able to calculate the pvalue. I looked at it at the end of our long discussion, but I only left my comments in the threat. I guess, I was to tired from several days of struggling with kendalltau, mannwhitneyu and friends, that once the confusion was cleared up and the bugs fixed, I wasn't in the "mood" of struggling to program in cython and get a cython based enhancement into svn (a first for me) (and I needed to get busy with other things). I started this summer to go through (some of) the stats tickets. But with the work on stats.models and other things that I'm interested in, eg. some extensions to the distributions, I don't have much time to finish up the missing work in "needs work" enhancement tickets. Bug fixes still have top priority, and filling in missing tests is second. Any help is very welcome. Josef
Sturla Molden skrev:
Josef,
I believe we had an discussion about various versions of Wilcoxon and Mann-Whitney some months ago. I find a discussion of this from february. We (or I alone?) also wrote Cython versions of the test, which I cannot find now. I should have filed a ticket then. :-(
Regards, Sturla Molden
josef.pktd@gmail.com skrev:
On Fri, Jul 3, 2009 at 12:06 PM, Elias Pampalk<elias.pampalk@gmail.com> wrote:
I did a quick comparison between Matlab/stats (R14SP3), R (2.8.1), and Python/SciPy (0.7). Maybe this is somehow useful for others too.
...
Elias
Thanks for doing this, this is very helpful. I attached the comparison to ticket:901
Except for one case, the numbers look pretty good. However, documentation is still weak, and some of the code duplication could be removed.
Josef
(I didn't look at it closely before, because I was on vacation at that time.) _______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
![](https://secure.gravatar.com/avatar/da3a0a1942fbdc5ee9a9b8115ac5dae7.jpg?s=120&d=mm&r=g)
On 2009-07-01, Elias Pampalk <elias.pampalk@gmail.com> wrote:
If I don't want to be searching through the source code - where should I be looking for up-to-date documentation?
Searching via Google for the documentation on scipy.stats Mann Whitney U test I found:
http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#mannwhitneyu http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mannw...
The second one is up-to-date but it's for scipy.stats.mstats.mannwhitneyu, not scipy.stats.mannwhitneyu. Apparently, the main stats.mannwhitneyu function was not included in the documentation. Fixed, should appear tomorrow. -- Pauli Virtanen
participants (9)
-
Ariel Rokem
-
David Cournapeau
-
David Warde-Farley
-
Elias Pampalk
-
Gael Varoquaux
-
josef.pktd@gmail.com
-
Pauli Virtanen
-
Sturla Molden
-
Xavier Saint-Mleux