[Numpy-discussion] another discussion on numpy correlate (and convolution)

Thu Feb 21 13:00:42 EST 2013

Hi everybody,

(just coming from a discussion on the performance of Matplotlib's
(x)corr function which uses np.correlate)

There have been already many discussions on how to compute
(cross-)correlations of time-series in Python (like
http://stackoverflow.com/questions/6991471/computing-cross-correlation-function).
The discussion is spread between the various stakeholders (just to name
some I've in mind : scipy, statsmodel, matplotlib, ...).

There are basically 2 implementations : time-domain and frequency-domain
(using fft + multiplication). My discussion is only on time-domain
implementation. The key usecase which I feel is not well adressed today
is when computing the (cross)correlation of two long timeseries on only
a few lagpoints.

For time-domain, one can either write its own implementation or rely on
numpy.correlate. The latter rely on the fast C implementation
_pyarray_correlate() in multiarraymodule.c
(https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L1177)

Now, the signature of this function is meant to accept one of three
*computation modes* ('valid', 'same', 'full'). Thoses modes make a lot
of sense when using this function in the context of convoluting two
signals (say an "input array" and a "impulse response array"). In the
context of computing the (cross)correlation of two long timeseries on
only a few lagpoints, those computation modes are unadapted and
potentially leads to huge computational and memory wastes
(for example :
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes.py#L4332).

For some time, I was thinking the solution was to write a dedicated
function in one of the stakeholder modules (but which ? statsmodel ?)
but now I came to think numpy is the best place to put a change and this
would quickly benefit to all stackeholders downstream.

Indeed, I looked more carefully at the C _pyarray_correlate() function,
and I've come to the conclusion that these three computation modes are
an unnecessary restriction because the actual computation relies on a
triple for loop
(https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiarraymodule.c#L1177)
which boundaries are governed by two integers `n_left` and `n_right`
instead of the three modes.

Maybe I've misunderstood the inner-working of this function, but I've
the feeling that the ability to set these two integers directly instead
of just the mode would open up the possibility to compute the
correlation on only a few lagpoints.

I'm fully aware that changing the signature of such a core numpy is
out-of-question but I've got the feeling that a reasonably small some
code refactor might lift the longgoing problem of (cross-)correlation
computation. The Python np.correlate would require two additional
keywords -`n_left` and `n_right`)which would override the `mode`
keyword. Only the C function would need some more care to keep good
backward compatibility in case it's used externally.

What do other people think ?

Best,
Pierre

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 900 bytes
Desc: OpenPGP digital signature
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130221/18da4418/attachment.sig>