another discussion on numpy correlate (and convolution)
Hi everybody, (just coming from a discussion on the performance of Matplotlib's (x)corr function which uses np.correlate) There have been already many discussions on how to compute (cross-)correlations of time-series in Python (like http://stackoverflow.com/questions/6991471/computing-cross-correlation-funct...). The discussion is spread between the various stakeholders (just to name some I've in mind : scipy, statsmodel, matplotlib, ...). There are basically 2 implementations : time-domain and frequency-domain (using fft + multiplication). My discussion is only on time-domain implementation. The key usecase which I feel is not well adressed today is when computing the (cross)correlation of two long timeseries on only a few lagpoints. For time-domain, one can either write its own implementation or rely on numpy.correlate. The latter rely on the fast C implementation _pyarray_correlate() in multiarraymodule.c (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiar...) Now, the signature of this function is meant to accept one of three *computation modes* ('valid', 'same', 'full'). Thoses modes make a lot of sense when using this function in the context of convoluting two signals (say an "input array" and a "impulse response array"). In the context of computing the (cross)correlation of two long timeseries on only a few lagpoints, those computation modes are unadapted and potentially leads to huge computational and memory wastes (for example : https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes.py#...). For some time, I was thinking the solution was to write a dedicated function in one of the stakeholder modules (but which ? statsmodel ?) but now I came to think numpy is the best place to put a change and this would quickly benefit to all stackeholders downstream. Indeed, I looked more carefully at the C _pyarray_correlate() function, and I've come to the conclusion that these three computation modes are an unnecessary restriction because the actual computation relies on a triple for loop (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiar...) which boundaries are governed by two integers `n_left` and `n_right` instead of the three modes. Maybe I've misunderstood the inner-working of this function, but I've the feeling that the ability to set these two integers directly instead of just the mode would open up the possibility to compute the correlation on only a few lagpoints. I'm fully aware that changing the signature of such a core numpy is out-of-question but I've got the feeling that a reasonably small some code refactor might lift the longgoing problem of (cross-)correlation computation. The Python np.correlate would require two additional keywords -`n_left` and `n_right`)which would override the `mode` keyword. Only the C function would need some more care to keep good backward compatibility in case it's used externally. What do other people think ? Best, Pierre
On Thu, Feb 21, 2013 at 1:00 PM, Pierre Haessig
Hi everybody,
(just coming from a discussion on the performance of Matplotlib's (x)corr function which uses np.correlate)
There have been already many discussions on how to compute (cross-)correlations of time-series in Python (like http://stackoverflow.com/questions/6991471/computing-cross-correlation-funct...). The discussion is spread between the various stakeholders (just to name some I've in mind : scipy, statsmodel, matplotlib, ...).
There are basically 2 implementations : time-domain and frequency-domain (using fft + multiplication). My discussion is only on time-domain implementation. The key usecase which I feel is not well adressed today is when computing the (cross)correlation of two long timeseries on only a few lagpoints.
For time-domain, one can either write its own implementation or rely on numpy.correlate. The latter rely on the fast C implementation _pyarray_correlate() in multiarraymodule.c (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiar...)
Now, the signature of this function is meant to accept one of three *computation modes* ('valid', 'same', 'full'). Thoses modes make a lot of sense when using this function in the context of convoluting two signals (say an "input array" and a "impulse response array"). In the context of computing the (cross)correlation of two long timeseries on only a few lagpoints, those computation modes are unadapted and potentially leads to huge computational and memory wastes (for example : https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes.py#...).
For some time, I was thinking the solution was to write a dedicated function in one of the stakeholder modules (but which ? statsmodel ?) but now I came to think numpy is the best place to put a change and this would quickly benefit to all stackeholders downstream.
Indeed, I looked more carefully at the C _pyarray_correlate() function, and I've come to the conclusion that these three computation modes are an unnecessary restriction because the actual computation relies on a triple for loop (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiar...) which boundaries are governed by two integers `n_left` and `n_right` instead of the three modes.
Maybe I've misunderstood the inner-working of this function, but I've the feeling that the ability to set these two integers directly instead of just the mode would open up the possibility to compute the correlation on only a few lagpoints.
I'm fully aware that changing the signature of such a core numpy is out-of-question but I've got the feeling that a reasonably small some code refactor might lift the longgoing problem of (cross-)correlation computation. The Python np.correlate would require two additional keywords -`n_left` and `n_right`)which would override the `mode` keyword. Only the C function would need some more care to keep good backward compatibility in case it's used externally.
What do other people think ?
I think it's a good idea. I didn't look at the implementation details. In statsmodels we also use explicit loops (x[lag:] * y[:lag]).sum(0) at places where we expect that in most cases we only need a few correlations (<10 or <20). (the fft and correlate versions are mainly used where we expect or need a large number or the full number of valid correlations) Josef
Best, Pierre
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi,
On Thu, Feb 21, 2013 at 10:00 AM, Pierre Haessig
Hi everybody,
(just coming from a discussion on the performance of Matplotlib's (x)corr function which uses np.correlate)
There have been already many discussions on how to compute (cross-)correlations of time-series in Python (like http://stackoverflow.com/questions/6991471/computing-cross-correlation-funct...). The discussion is spread between the various stakeholders (just to name some I've in mind : scipy, statsmodel, matplotlib, ...).
There are basically 2 implementations : time-domain and frequency-domain (using fft + multiplication). My discussion is only on time-domain implementation. The key usecase which I feel is not well adressed today is when computing the (cross)correlation of two long timeseries on only a few lagpoints.
For time-domain, one can either write its own implementation or rely on numpy.correlate. The latter rely on the fast C implementation _pyarray_correlate() in multiarraymodule.c (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiar...)
Now, the signature of this function is meant to accept one of three *computation modes* ('valid', 'same', 'full'). Thoses modes make a lot of sense when using this function in the context of convoluting two signals (say an "input array" and a "impulse response array"). In the context of computing the (cross)correlation of two long timeseries on only a few lagpoints, those computation modes are unadapted and potentially leads to huge computational and memory wastes (for example : https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes.py#...).
For some time, I was thinking the solution was to write a dedicated function in one of the stakeholder modules (but which ? statsmodel ?) but now I came to think numpy is the best place to put a change and this would quickly benefit to all stackeholders downstream.
Indeed, I looked more carefully at the C _pyarray_correlate() function, and I've come to the conclusion that these three computation modes are an unnecessary restriction because the actual computation relies on a triple for loop (https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiar...) which boundaries are governed by two integers `n_left` and `n_right` instead of the three modes.
Maybe I've misunderstood the inner-working of this function, but I've the feeling that the ability to set these two integers directly instead of just the mode would open up the possibility to compute the correlation on only a few lagpoints.
I'm fully aware that changing the signature of such a core numpy is out-of-question but I've got the feeling that a reasonably small some code refactor might lift the longgoing problem of (cross-)correlation computation. The Python np.correlate would require two additional keywords -`n_left` and `n_right`)which would override the `mode` keyword. Only the C function would need some more care to keep good backward compatibility in case it's used externally.
From complete ignorance, do you think it is an option to allow a (n_left, n_right) tuple as a value for 'mode'?
Cheers, Matthew
We don't actually want remove sensitive data, but this tutorial should
still allow us to remove a file totally and completely from git history. It
doesn't look that hard:
https://help.github.com/articles/remove-sensitive-data
It will require everyone to rebase, so if you want to do this it may be a
good idea to schedule it for a specific day maybe 2-4 weeks in the future
and warn people on the mailing list so nobody tries to commit anything
while the process is underway.
On Feb 22, 2013 5:40 PM, "Matthew Brett"
Hi,
On Thu, Feb 21, 2013 at 10:00 AM, Pierre Haessig
wrote: Hi everybody,
(just coming from a discussion on the performance of Matplotlib's (x)corr function which uses np.correlate)
There have been already many discussions on how to compute (cross-)correlations of time-series in Python (like
http://stackoverflow.com/questions/6991471/computing-cross-correlation-funct... ).
The discussion is spread between the various stakeholders (just to name some I've in mind : scipy, statsmodel, matplotlib, ...).
There are basically 2 implementations : time-domain and frequency-domain (using fft + multiplication). My discussion is only on time-domain implementation. The key usecase which I feel is not well adressed today is when computing the (cross)correlation of two long timeseries on only a few lagpoints.
For time-domain, one can either write its own implementation or rely on numpy.correlate. The latter rely on the fast C implementation _pyarray_correlate() in multiarraymodule.c ( https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiar... )
Now, the signature of this function is meant to accept one of three *computation modes* ('valid', 'same', 'full'). Thoses modes make a lot of sense when using this function in the context of convoluting two signals (say an "input array" and a "impulse response array"). In the context of computing the (cross)correlation of two long timeseries on only a few lagpoints, those computation modes are unadapted and potentially leads to huge computational and memory wastes (for example :
https://github.com/matplotlib/matplotlib/blob/master/lib/matplotlib/axes.py#... ).
For some time, I was thinking the solution was to write a dedicated function in one of the stakeholder modules (but which ? statsmodel ?) but now I came to think numpy is the best place to put a change and this would quickly benefit to all stackeholders downstream.
Indeed, I looked more carefully at the C _pyarray_correlate() function, and I've come to the conclusion that these three computation modes are an unnecessary restriction because the actual computation relies on a triple for loop (
https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/multiar... )
which boundaries are governed by two integers `n_left` and `n_right` instead of the three modes.
Maybe I've misunderstood the inner-working of this function, but I've the feeling that the ability to set these two integers directly instead of just the mode would open up the possibility to compute the correlation on only a few lagpoints.
I'm fully aware that changing the signature of such a core numpy is out-of-question but I've got the feeling that a reasonably small some code refactor might lift the longgoing problem of (cross-)correlation computation. The Python np.correlate would require two additional keywords -`n_left` and `n_right`)which would override the `mode` keyword. Only the C function would need some more care to keep good backward compatibility in case it's used externally.
From complete ignorance, do you think it is an option to allow a (n_left, n_right) tuple as a value for 'mode'?
Cheers,
Matthew _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Hi, Le 22/02/2013 17:40, Matthew Brett a écrit :
From complete ignorance, do you think it is an option to allow a (n_left, n_right) tuple as a value for 'mode'?
That may be an option. Another one would be to add some kind of `bounds` option which would be set to None by default but would accept the (n_left, n_right) tuple and would override `mode`. I don't know which one is better. The only thing I've in mind which may cause confusion is that `mode` normally receives a string ('valid', 'same', 'full') but it also accept corresponding integer codes (undocumented, but I guess it corresponds to a old signature of that function). So I wonder if there could be a confusion between : * mode = n * mode = (n_left, n_right) What do you think ? Best, Pierre
Hi,
On Tue, Feb 26, 2013 at 12:20 AM, Pierre Haessig
Hi,
Le 22/02/2013 17:40, Matthew Brett a écrit :
From complete ignorance, do you think it is an option to allow a (n_left, n_right) tuple as a value for 'mode'?
That may be an option. Another one would be to add some kind of `bounds` option which would be set to None by default but would accept the (n_left, n_right) tuple and would override `mode`.
I don't know which one is better.
Personally I try to avoid mutually incompatible arguments. I guess you'd have to raise and error if the bounds were defined as well as mode?
The only thing I've in mind which may cause confusion is that `mode` normally receives a string ('valid', 'same', 'full') but it also accept corresponding integer codes (undocumented, but I guess it corresponds to a old signature of that function). So I wonder if there could be a confusion between : * mode = n * mode = (n_left, n_right)
Maybe deprecate the integer arguments? It doesn't seem too bad to me that the numbers in the tuple are different in meaning from a scalar, particularly if the scalar argument is undocumented and will soon go away. Cheers, Matthew
participants (4)
-
josef.pktd@gmail.com
-
Matthew Brett
-
Pierre Haessig
-
Todd