[Numpy-discussion] autocorrelation computation performance : use of np.correlate

Wed Feb 1 13:31:02 EST 2012

Hi,

[I'm not sure whether this discussion belongs to numpy-discussion or 
scipy-dev]

In day to day time series analysis I regularly need to look at the data 
autocorrelation ("acorr" or "acf" depending on the software package).
The straighforward available function I have is matplotlib.pyplot.acorr. 
However, for a moderately long time series (say of length 10**5) it 
taking a huge time just to just dislays the autocorrelation values 
within a small range of time lags.
The main reason being it is relying on np.correlate(x,x, mode=2) while 
only a few lags are needed.
(I guess mode=2 is an (old fashioned?) way to set mode='full')

I know that np.correlate performance issue has been discussed already, 
and there is a *somehow* related ticket 
(http://projects.scipy.org/numpy/ticket/1260). I noticed in the ticket's 
change number 2 the following comment by Josef : "Maybe a truncated 
convolution/correlation would be good". I'll come back to this soon.

I made an example script "acf_timing.py" to start my point with some 
timing data :

In Ipython:
 >>> run acf_timing.py # it imports statsmodel's acf + define 2 other 
acf implementations + an example data 10**5 samples long

%time l,c = mpl_acf(a, 10)
CPU times: user 8.69 s, sys: 0.00 s, total: 8.69 s
Wall time: 11.18 s # pretty long...

  %time c = sm_acf(a, 10)
CPU times: user 8.76 s, sys: 0.01 s, total: 8.78 s
Wall time: 10.79 s # long as well. statsmodel has a similar underlying 
implementation
# 
http://statsmodels.sourceforge.net/generated/scikits.statsmodels.tsa.stattools.acf.html#scikits.statsmodels.tsa.stattools.acf

#Now, better option : use the fft convolution
%time c=sm_acf(a, 10,fft=True)
CPU times: user 0.03 s, sys: 0.01 s, total: 0.04 s
Wall time: 0.07 s
# Fast, but I'm not sure about the memory implication of using fft though.

#The naive option : just compute the acf lags that are needed
%time l,c = naive_acf(a, 10)
CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
Wall time: 0.01 s
# Iterative computation. Pretty silly but very fast
# (Now of course, this naive implementation won't scale nicely for a lot 
of lags)

Now comes (at last) the question : what should be done about this 
performance issue ?
  - should there be a truncated np.convolve/np.correlate function, as 
Josef suggested ?
  - or should people in need of autocorrelation find some workarounds 
because this usecase is not big enough to call for a change in np.convolve ?

I really feel this question is about *where* a change should be 
implemented  (numpy, scipy.signal, maplotlib ?) so that it makes sense 
while not breaking 10^10 lines of numpy related code...

Best,
Pierre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: acf_timing.py
Type: text/x-python
Size: 1780 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120201/d803de7e/attachment.py>