Dear all, I get the following memory error while running my program: *Traceback (most recent call last): File "/home/nistl/Software/Netzbetreiber/FLOW/src/MemoryError_Debug.py", line 9, in <module> correlation = corrcoef(data_records) File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1992, in corrcoef c = cov(x, y, rowvar, bias) File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1973, in cov return (dot(X, X.T.conj()) / fact).squeeze() MemoryError* Here an easy example how to reproduce the error: *#!/usr/bin/env python2.7 from pybinsel import open from numpy import * if __name__ == '__main__': data_records = random.random((459375, 24)) correlation = corrcoef(data_records) *My real data has the same dimension. Is this a size problem of the array or did I simply make a mistake in the application of corrcoef? I hope that you can help me! Thanks! Best regards, Nicole Stoffels -- Dipl.-Met. Nicole Stoffels Wind Power Forecasting and Simulation ForWind - Center for Wind Energy Research Institute of Physics Carl von Ossietzky University Oldenburg Ammerländer Heerstr. 136 D-26129 Oldenburg Tel: +49(0)441 798 - 5079 Fax: +49(0)441 798 - 5099 Web : www.ForWind.de Email: nicole.stoffels@forwind.de
Hi Nicole, Le 27/03/2012 11:12, Nicole Stoffels a écrit :
*if __name__ == '__main__':
data_records = random.random((459375, 24)) correlation = corrcoef(data_records)*
May I assume that your data_record is made of 24 different variables of which you have 459375 observations ? If this is so and if you expect corrcoeff to return a 24*24 matrix, you need to either transpose data_records :
correlation = corrcoef(data_records.T)
or use the rowvar=0 argument (see np.corrcoef or np.cov docstring)
correlation = corrcoef(data_records, rowvar = 0)
Both work on my computer, while your example indeed leads to a MemoryError (because shape 459375*459375 would be a decently big matrix...) I don't know if it's your case, but for those used to the Matlab (and textbooks) convention of having variables stored in columns, the default behaviour of numpy's covariance function is a bit surprising. I guess historical reasons are involved in this choice. Just a matter of getting used to it ! Best, Pierre
Hi Pierre, thanks for the fast answer! I actually have timeseries of 24 hours for 459375 gridpoints in Europe. The timeseries of every grid point is stored in a column. That's why in my real program I already transposed the data, so that the correlation is made column by column. What I finally need is the correlation of each gridpoint with every other gridpoint. I'm afraid that this results in a 459375*459375 matrix. The correlation is actually just an interim result. So I'm currently trying to loop over every gridpoint to get single correlations which will then be processed further. Is this the right approach? for column in range(len(data_records)): for columnnumber in range(len(data_records)): correlation = corrcoef(data_records[column], data_records[columnnumber]) Best wished, Nicole On 27.03.2012 11:38, Pierre Haessig wrote:
Hi Nicole,
Le 27/03/2012 11:12, Nicole Stoffels a écrit :
*if __name__ == '__main__':
data_records = random.random((459375, 24)) correlation = corrcoef(data_records)*
May I assume that your data_record is made of 24 different variables of which you have 459375 observations ?
If this is so and if you expect corrcoeff to return a 24*24 matrix, you need to either transpose data_records :
correlation = corrcoef(data_records.T)
or use the rowvar=0 argument (see np.corrcoef or np.cov docstring)
correlation = corrcoef(data_records, rowvar = 0)
Both work on my computer, while your example indeed leads to a MemoryError (because shape 459375*459375 would be a decently big matrix...)
I don't know if it's your case, but for those used to the Matlab (and textbooks) convention of having variables stored in columns, the default behaviour of numpy's covariance function is a bit surprising. I guess historical reasons are involved in this choice. Just a matter of getting used to it !
Best, Pierre
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Dipl.-Met. Nicole Stoffels Wind Power Forecasting and Simulation ForWind - Center for Wind Energy Research Institute of Physics Carl von Ossietzky University Oldenburg Ammerländer Heerstr. 136 D-26129 Oldenburg Tel: +49(0)441 798 - 5079 Fax: +49(0)441 798 - 5099 Web : www.ForWind.de Email: nicole.stoffels@forwind.de
Le 27 mars 2012 06:04, Nicole Stoffels <nicole.stoffels@forwind.de> a écrit :
** Hi Pierre,
thanks for the fast answer!
I actually have timeseries of 24 hours for 459375 gridpoints in Europe. The timeseries of every grid point is stored in a column. That's why in my real program I already transposed the data, so that the correlation is made column by column. What I finally need is the correlation of each gridpoint with every other gridpoint. I'm afraid that this results in a 459375*459375 matrix.
The correlation is actually just an interim result. So I'm currently trying to loop over every gridpoint to get single correlations which will then be processed further. Is this the right approach?
for column in range(len(data_records)): for columnnumber in range(len(data_records)): correlation = corrcoef(data_records[column], data_records[columnnumber])
Best wished, Nicole
It may be painfully slow... You should make sure you don't compute twice each off-diagonal element. Also, if all your computations can be vectorized, you'll probably get a significant performance boost by computing your matrix by blocks instead of element-by-element. Take blocks as big as can fit in memory. -=- Olivier
Hi, On Tue, Mar 27, 2012 at 12:12 PM, Nicole Stoffels < nicole.stoffels@forwind.de> wrote:
** Dear all,
I get the following memory error while running my program:
*Traceback (most recent call last): File "/home/nistl/Software/Netzbetreiber/FLOW/src/MemoryError_Debug.py", line 9, in <module> correlation = corrcoef(data_records) File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1992, in corrcoef c = cov(x, y, rowvar, bias) File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1973, in cov return (dot(X, X.T.conj()) / fact).squeeze() MemoryError*
Here an easy example how to reproduce the error:
*#!/usr/bin/env python2.7
from pybinsel import open from numpy import *
if __name__ == '__main__':
data_records = random.random((459375, 24)) correlation = corrcoef(data_records)
*My real data has the same dimension. Is this a size problem of the array or did I simply make a mistake in the application of corrcoef?
I hope that you can help me! Thanks!
As other ones has explained this approach yields an enormous matrix. However, if I have understood your problem correctly you could implement a helper class to iterate over all of your observations. Something like along the lines (although it will take hours? with your data size) to iterate over all correlations: """A helper class for correlations between observations.""" import numpy as np class Correlations(object): def __init__(self, data): self.m= data.shape[0] # compatible with corrcoef self.scale= self.m- 1 self.data= data- np.mean(data, 1)[:, None] # but you may actually need to scale and translate data # more application speficic manner self.var= (self.data** 2.).sum(1)/ self.scale def obs_kth(self, k): c= np.dot(self.data, self.data[k])/ self.scale return c/ (self.var[k]* self.var)** .5 def obs_iterate(self): for k in xrange(self.m): yield self.obs_kth(k) if __name__ == '__main__': data= np.random.randn(5, 3) print np.corrcoef(data).round(3) print c= Correlations(data) print np.array([p for p in c.obs_iterate()]).round(3) My 2 cents, -eat
Best regards,
Nicole Stoffels
--
Dipl.-Met. Nicole Stoffels
Wind Power Forecasting and Simulation
ForWind - Center for Wind Energy Research Institute of Physics Carl von Ossietzky University Oldenburg
Ammerländer Heerstr. 136 D-26129 Oldenburg
Tel: +49(0)441 798 - 5079 Fax: +49(0)441 798 - 5099
Web : www.ForWind.de Email: nicole.stoffels@forwind.de
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (5)
-
eat -
Nicole Stoffels -
Olivier Delalleau -
Pierre Haessig -
Richard Hattersley