Question on lstsq and correlation coeff

Hi, all, It is probably a newbie question. I trying to use scipy/numpy in a finanical context. I want to compute the correlation coeff of two series (returns vs index returns). I tried two appoarches Firstly, from scipy.linalg import lstsq coeffs,a,b,c = lstsq(matrix, returns) # matrix contains index returns then I tried, import numpy as np cov = np.cov(idx1, returns) print cov.tolist() stddev_x = np.std(returns, ddof=1) stddev_y = np.std(idx1, ddof=1) print "cor = %s" % (cov.tolist()[:-1] /(stddev_x * stddev_y)) They differ from each other. As you can see from the numpy example, I am trying to find cor coeff for a sample. (ddof=1) So, my question is: is the discrepency caused by the fact that I am trying to use lstsq() on a 'sample popluation' (i.e. I am not regressing a full return series)? Is it correct to use lstsq() this way? Cheers, Anthony NOTICE This e-mail and any attachments are confidential and may contain copyright material of Macquarie Group Limited or third parties. If you are not the intended recipient of this email you should not read, print, re-transmit, store or act in reliance on this e-mail or any attachments, and should destroy all copies of them. Macquarie Group Limited does not guarantee the integrity of any emails or any attached files. The views or opinions expressed are the author's own and may not reflect the views or opinions of Macquarie Group Limited.

On Wed, Feb 25, 2009 at 6:21 PM, Anthony Kong <Anthony.Kong@macquarie.com> wrote:
Hi, all,
It is probably a newbie question.
I trying to use scipy/numpy in a finanical context. I want to compute the correlation coeff of two series (returns vs index returns). I tried two appoarches
Firstly,
from scipy.linalg import lstsq coeffs,a,b,c = lstsq(matrix, returns) # matrix contains index returns
then I tried,
import numpy as np cov = np.cov(idx1, returns) print cov.tolist() stddev_x = np.std(returns, ddof=1) stddev_y = np.std(idx1, ddof=1) print "cor = %s" % (cov.tolist()[:-1] /(stddev_x * stddev_y)) They differ from each other.
As you can see from the numpy example, I am trying to find cor coeff for a sample. (ddof=1)
So, my question is: is the discrepency caused by the fact that I am trying to use lstsq() on a 'sample popluation' (i.e. I am not regressing a full return series)? Is it correct to use lstsq() this way?
the most direct way to calculate the correlation matrix, use index [0,1] to get coefficient. numpy.corrcoef(x, y=None, rowvar=1, bias=0) np.cov, that you used, uses biased estimator, denominator = N by default, but for std you used N-1 Josef

Hi, Josef, Thanks very much for the quick and helpful response. Could you also comment on the use of lstsq(): Why it leads to inconsistent result? Cheers, Anthony -----Original Message----- From: numpy-discussion-bounces@scipy.org [mailto:numpy-discussion-bounces@scipy.org] On Behalf Of josef.pktd@gmail.com Sent: Thursday, 26 February 2009 11:09 AM To: Discussion of Numerical Python Subject: Re: [Numpy-discussion] Question on lstsq and correlation coeff On Wed, Feb 25, 2009 at 6:21 PM, Anthony Kong <Anthony.Kong@macquarie.com> wrote:
Hi, all,
It is probably a newbie question.
I trying to use scipy/numpy in a finanical context. I want to compute the correlation coeff of two series (returns vs index returns). I tried two appoarches
Firstly,
from scipy.linalg import lstsq coeffs,a,b,c = lstsq(matrix, returns) # matrix contains index returns
then I tried,
import numpy as np cov = np.cov(idx1, returns) print cov.tolist() stddev_x = np.std(returns, ddof=1) stddev_y = np.std(idx1, ddof=1) print "cor = %s" % (cov.tolist()[:-1] /(stddev_x * stddev_y)) They differ from each other.
As you can see from the numpy example, I am trying to find cor coeff for a sample. (ddof=1)
So, my question is: is the discrepency caused by the fact that I am trying to use lstsq() on a 'sample popluation' (i.e. I am not regressing a full return series)? Is it correct to use lstsq() this way?
the most direct way to calculate the correlation matrix, use index [0,1] to get coefficient. numpy.corrcoef(x, y=None, rowvar=1, bias=0) np.cov, that you used, uses biased estimator, denominator = N by default, but for std you used N-1 Josef _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion NOTICE This e-mail and any attachments are confidential and may contain copyright material of Macquarie Group Limited or third parties. If you are not the intended recipient of this email you should not read, print, re-transmit, store or act in reliance on this e-mail or any attachments, and should destroy all copies of them. Macquarie Group Limited does not guarantee the integrity of any emails or any attached files. The views or opinions expressed are the author's own and may not reflect the views or opinions of Macquarie Group Limited.

On Wed, Feb 25, 2009 at 3:21 PM, Anthony Kong <Anthony.Kong@macquarie.com> wrote:
I trying to use scipy/numpy in a finanical context. I want to compute the correlation coeff of two series (returns vs index returns). I tried two appoarches
Firstly,
from scipy.linalg import lstsq coeffs,a,b,c = lstsq(matrix, returns) # matrix contains index returns
then I tried,
import numpy as np cov = np.cov(idx1, returns) print cov.tolist() stddev_x = np.std(returns, ddof=1) stddev_y = np.std(idx1, ddof=1) print "cor = %s" % (cov.tolist()[:-1] /(stddev_x * stddev_y)) They differ from each other.
coeffs in coeffs,a,b,c = lstsq(matrix, returns) # matrix contains index returns is the beta of the stock with respect to the index, not the correlation.

intersect1d and setmember1d doesn't give expected results in case there are duplicate values in either array becuase it works by sorting data and substracting previous value. Is there an alternative in numpy to get indices of intersected values. In [31]: p nonzero(setmember1d(v1.Id, v2.Id))[0] [ 0 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 <-------------- index 2 shouldn't be here look at the data below. 26 27 28 29] In [32]: p v1.Id[:10] [ 232. 232. 233. 233. 234. 234. 235. 235. 237. 237.] In [33]: p v2.Id[:10] [ 232. 232. 234. 234. 235. 235. 236. 236. 237. 237.] In [34]: p setmember1d(v1.Id, v2.Id) [ True True True False True True True True True True True True <-------------- index 2 shouldn't be True True True True True True True True True True True True True True True True True True True] In [35]: p setmember1d(v1.Id[:10], v2.Id[:10]) [ True True True False True True True True True True]

Hi,
intersect1d and setmember1d doesn't give expected results in case there are duplicate values in either array becuase it works by sorting data and substracting previous value. Is there an alternative in numpy to get indices of intersected values.
From the docstring for setmember1d (and other set operations), you are only supposed to pass it arrays with unique values (i.e. arrays that represent sets in the mathematical sense):
print numpy.setmember1d.__doc__ Return a boolean array set True where first element is in second array.
Boolean array is the shape of `ar1` containing True where the elements of `ar1` are in `ar2` and False otherwise. Use unique1d() to generate arrays with only unique elements to use as inputs to this function. [...] As stated, use unique1d to generate set-arrays from your input. On the other hand, intersect1d is supposed to work with repeated elements:
print numpy.intersect1d.__doc__ Intersection returning repeated or unique elements common to both arrays.
Parameters ---------- ar1,ar2 : array_like Input arrays. Returns ------- out : ndarray, shape(N,) Sorted 1D array of common elements with repeating elements. See Also -------- intersect1d_nu : Returns only unique common elements. [...] Do you have an example of intersect1d not working right? If so, what version of numpy are you using (print numpy.version.version)? Zach On Feb 26, 2009, at 12:48 PM, mudit sharma wrote:
intersect1d and setmember1d doesn't give expected results in case there are duplicate values in either array becuase it works by sorting data and substracting previous value. Is there an alternative in numpy to get indices of intersected values.
In [31]: p nonzero(setmember1d(v1.Id, v2.Id))[0] [ 0 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 <-------------- index 2 shouldn't be here look at the data below. 26 27 28 29]
In [32]: p v1.Id[:10] [ 232. 232. 233. 233. 234. 234. 235. 235. 237. 237.]
In [33]: p v2.Id[:10] [ 232. 232. 234. 234. 235. 235. 236. 236. 237. 237.]
In [34]: p setmember1d(v1.Id, v2.Id) [ True True True False True True True True True True True True <-------------- index 2 shouldn't be True True True True True True True True True True True True True True True True True True True]
In [35]: p setmember1d(v1.Id[:10], v2.Id[:10]) [ True True True False True True True True True True]
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion

Zachary Pincus wrote:
Hi,
intersect1d and setmember1d doesn't give expected results in case there are duplicate values in either array becuase it works by sorting data and substracting previous value. Is there an alternative in numpy to get indices of intersected values.
From the docstring for setmember1d (and other set operations), you are only supposed to pass it arrays with unique values (i.e. arrays that represent sets in the mathematical sense):
print numpy.setmember1d.__doc__ Return a boolean array set True where first element is in second array.
Boolean array is the shape of `ar1` containing True where the elements of `ar1` are in `ar2` and False otherwise.
Use unique1d() to generate arrays with only unique elements to use as inputs to this function. [...]
As stated, use unique1d to generate set-arrays from your input.
On the other hand, intersect1d is supposed to work with repeated elements:
print numpy.intersect1d.__doc__ Intersection returning repeated or unique elements common to both arrays.
Parameters ---------- ar1,ar2 : array_like Input arrays.
Returns ------- out : ndarray, shape(N,) Sorted 1D array of common elements with repeating elements.
See Also -------- intersect1d_nu : Returns only unique common elements. [...]
Do you have an example of intersect1d not working right? If so, what version of numpy are you using (print numpy.version.version)?
Zach
Hi, yes, many functions in arraysetops.py ('intersect1d', 'setxor1d', 'setmember1d', 'union1d', 'setdiff1d') were originally meant to work with arrays of unique elements as inputs. I have just noticed, that the docstring of intersect1d says that it works for non-unique arrays and contains the following example:
np.intersect1d([1,3,3],[3,1,1]) array([1, 1, 3, 3])
I am not sure if this is a useful behaviour - does anybody uses this "feature" (or better, side-effect)? I would like to change the example to the usual use case: In [9]: np.intersect1d([1,2,4,3],[3,1,5]) Out[9]: array([1, 3]) For arrays with non-unique elements, there is: In [11]: np.intersect1d_nu([1,3,3],[3,1,1]) Out[11]: array([1, 3]) which just calls unique1d() for its arguments. cheers, r.

intersect1d and setmember1d doesn't give expected results in case there are duplicate values in either array becuase it works by sorting data and substracting previous value. Is
mudit sharma <mudit_19a <at> yahoo.com> writes: there an alternative in numpy
to get indices of intersected values.
In [31]: p nonzero(setmember1d(v1.Id, v2.Id))[0] [ 0 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 <-------------- index 2 shouldn't be here look at the data below. 26 27 28 29]
In [32]: p v1.Id[:10] [ 232. 232. 233. 233. 234. 234. 235. 235. 237. 237.]
In [33]: p v2.Id[:10] [ 232. 232. 234. 234. 235. 235. 236. 236. 237. 237.]
As far as I know there isn't an obvious way to get the functionality of setmember1d working on non-unique inputs. However, I've needed this operation quite a lot, so here's a function I wrote that does it. It's only a few times slower than numpy's setmember1d. You're welcome to use it. import numpy as np def ismember(a1,a2): """ Test whether items from a2 are in a1. This does the same thing as np.setmember1d, but works on non-unique arrays. Only a few (2-4) times slower than np.setmember1d, and a lot faster than [i in a2 for i in a1]. An example that np.setmember1d gets wrong: >>> a1 = np.array([5,4,5,3,4,4,3,4,3,5,2,1,5,5]) >>> a2 = [2,3,4] >>> mask = ismember(a1,a2) >>> a1[mask] array([4, 3, 4, 4, 3, 4, 3, 2]) """ a2 = set(a2) a1 = np.asarray(a1) ind = a1.argsort() a1 = a1[ind] mask = [] # need this bit because prev is not defined for first item item = a1[0] if item in a2: mask.append(True) a2.remove(item) else: mask.append(False) prev = item # main loop for item in a1[1:]: if item == prev: mask.append(mask[-1]) elif item in a2: mask.append(True) prev = item a2.remove(item) else: mask.append(False) prev = item # restore mask to original ordering of a1 and return mask = np.array(mask) return mask[ind.argsort()]

Neil wrote:
mudit sharma <mudit_19a <at> yahoo.com> writes:
intersect1d and setmember1d doesn't give expected results in case there are duplicate values in either array becuase it works by sorting data and substracting previous value. Is there an alternative in numpy to get indices of intersected values.
In [31]: p nonzero(setmember1d(v1.Id, v2.Id))[0] [ 0 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 <-------------- index 2 shouldn't be here look at the data below. 26 27 28 29]
In [32]: p v1.Id[:10] [ 232. 232. 233. 233. 234. 234. 235. 235. 237. 237.]
In [33]: p v2.Id[:10] [ 232. 232. 234. 234. 235. 235. 236. 236. 237. 237.]
As far as I know there isn't an obvious way to get the functionality of setmember1d working on non-unique inputs. However, I've needed this operation quite a lot, so here's a function I wrote that does it. It's only a few times slower than numpy's setmember1d. You're welcome to use it.
Hi Neil! I would like to add your function to arraysetops.py - is it ok? Just the name would be changed to setmember1d_nu, to follow the naming in the module (like intersect1d_nu). Thank you, r.

Robert Cimrman <cimrman3 <at> ntc.zcu.cz> writes:
Hi Neil!
I would like to add your function to arraysetops.py - is it ok? Just the name would be changed to setmember1d_nu, to follow the naming in the module (like intersect1d_nu).
Thank you, r.
That's fine! There's no licence attached, it's in the public domain. Neil

On Mon, Mar 2, 2009 at 03:39, Neil Crighton <neilcrighton@gmail.com> wrote:
Robert Cimrman <cimrman3 <at> ntc.zcu.cz> writes:
Hi Neil!
I would like to add your function to arraysetops.py - is it ok? Just the name would be changed to setmember1d_nu, to follow the naming in the module (like intersect1d_nu).
Thank you, r.
That's fine! There's no licence attached, it's in the public domain.
Do you mind if we just add you to the THANKS.txt file, and consider you as a "NumPy Developer" per the LICENSE.txt as having released that code under the numpy license? If we're dotting our i's and crossing our t's legally, that's a bit more straightforward (oddly enough). -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco

Robert Kern <robert.kern <at> gmail.com> writes:
Do you mind if we just add you to the THANKS.txt file, and consider you as a "NumPy Developer" per the LICENSE.txt as having released that code under the numpy license? If we're dotting our i's and crossing our t's legally, that's a bit more straightforward (oddly enough).
No, I don't mind having it released under the numpy licence. Neil

Neil Crighton wrote:
Robert Kern <robert.kern <at> gmail.com> writes:
Do you mind if we just add you to the THANKS.txt file, and consider you as a "NumPy Developer" per the LICENSE.txt as having released that code under the numpy license? If we're dotting our i's and crossing our t's legally, that's a bit more straightforward (oddly enough).
No, I don't mind having it released under the numpy licence.
OK, I will tak care of including it - how should I proceed now? - has the workflow discussion settled somehow? r.

Robert Cimrman wrote:
Neil Crighton wrote:
Robert Kern <robert.kern <at> gmail.com> writes:
Do you mind if we just add you to the THANKS.txt file, and consider you as a "NumPy Developer" per the LICENSE.txt as having released that code under the numpy license? If we're dotting our i's and crossing our t's legally, that's a bit more straightforward (oddly enough).
No, I don't mind having it released under the numpy licence.
OK, I will tak care of including it - how should I proceed now? - has the workflow discussion settled somehow?
I have created http://projects.scipy.org/numpy/ticket/1036 - the patch will go there. r.
participants (9)
-
Anthony Kong
-
josef.pktd@gmail.com
-
Keith Goodman
-
mudit sharma
-
Neil
-
Neil Crighton
-
Robert Cimrman
-
Robert Kern
-
Zachary Pincus