[Chicago] Need advice on this project.
Sunhwan Jo
sunhwanj at gmail.com
Tue Nov 10 09:52:58 EST 2015
1. Your “correlation” function takes most of the execution time.
> def Correlation(p, q):
> global PQ_Ratings
> sum1 = 0
> sum2 = 0
> numeratorProduct = 1
> denominatorProduct1 = 1
> denominatorProduct2 = 1
> for key in filter( lambda x: x[0] == p or x[0] == q, PQ_Ratings.keys( ) ):
> if key[0] == p:
> sum1+= PQ_Ratings[key] - AverageRatingsOfItems[key[1]]
> else:
> sum2+= PQ_Ratings[key] - AverageRatingsOfItems[key[1]]
> numeratorProduct+= sum1*sum2
> denominatorProduct1+= sum1**2
> denominatorProduct2+= sum2**2
> return numeratorProduct/(math.sqrt(denominatorProduct1)*math.sqrt(denominatorProduct2))
By changing sum1 and sum2 as list comprehension can increase the execution speed about 10x (rough estimate using your code). In addition, the denominator is also wrong. It should be *sum of squared differences* not *square of sum of differences*, but I’m not concerned at this yet.
> def Correlation(p, q):
> global PQ_Ratings
> sum1 = 0
> sum2 = 0
> numeratorProduct = 1
> denominatorProduct1 = 1
> denominatorProduct2 = 1
> keys = [key for key in PQ_Ratings.keys() if key[0] == p or key[0] == q]
> sum1 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]] for key in keys if key[0] == p])
> sum2 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]] for key in keys if key[0] == q])
> numeratorProduct+= sum1*sum2
> denominatorProduct1+= sum1**2
> denominatorProduct2+= sum2**2
> return numeratorProduct/(math.sqrt(denominatorProduct1)*math.sqrt(denominatorProduct2))
2. You don’t have to re-calculate sum1 each time. “sum1" only depends on “p”. So, you can calculate that only in the outer loop and reuse it.
> keys = PQ_Ratings.keys()
> for i in range(1, len(SimilarityMatrix)):
> sum1 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]] for key in keys if key[0] == i])
>
> for j in range(i + 1, len(SimilarityMatrix)):
> sum2 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]] for key in keys if key[0] == j])
> numeratorProduct = sum1*sum2 + 1
> denominatorProduct1 = sum1**2 + 1
> denominatorProduct2 = sum2**2 + 1
> SimilarityMatrix[i][j] = numeratorProduct/(math.sqrt(denominatorProduct1)*math.sqrt(denominatorProduct2))
This will again speed up but the total execution time is about 200 minutes with +900 users.
3. Is there any reason not to use NumPy array? Using NumPy it finishes less than a fraction of a minute. Notice I also fixed the bug in the nominator and the denominator.
> import numpy as np
> nitems = max(AverageRatingsOfItems.keys())
> nusers = max([key[0] for key in PQ_Ratings.keys()])
> avg_rating = np.zeros(nitems)
> pq_rating = np.zeros((nusers, nitems))
> keys = PQ_Ratings.keys()
> for key in keys:
> pq_rating[key[0]-1, key[1]-1] = PQ_Ratings[key]
> keys = AverageRatingsOfItems.keys()
> for key in keys:
> avg_rating[key-1] = AverageRatingsOfItems[key]
>
> startTime = time.time( )
>
> #### Let's finish building up our similarity matrix for this problem.
> keys = PQ_Ratings.keys()
> for i in range(1, len(SimilarityMatrix)):
> #sum1 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]] for key in keys if key[0] == i])
> diff1 = np.sum(pq_rating[i-1] - avg_rating)
>
> for j in range(i + 1, len(SimilarityMatrix)):
> #sum2 = sum([PQ_Ratings[key] - AverageRatingsOfItems[key[1]] for key in keys if key[0] == j])
> diff2 = np.sum(pq_rating[j-1] - avg_rating)
> numeratorProduct = np.sum(diff1*diff2)
> denominatorProduct1 = np.sum(diff1**2)
> denominatorProduct2 = np.sum(diff2**2)
> SimilarityMatrix[i][j] = numeratorProduct/(math.sqrt(denominatorProduct1)*math.sqrt(denominatorProduct2))
> On Nov 9, 2015, at 7:44 PM, Lewit, Douglas <d-lewit at neiu.edu> wrote:
>
> Hey guys,
>
> I need some advice on this one. I'm attaching the homework assignment so that you understand what I'm trying to do. I went as far as the construction of the Similarity Matrix, which is a matrix of Pearson correlation coefficients.
>
> My problem is this. u1.base (which is also attached) contains Users (first column), Items (second column), Ratings (third column) and finally the time stamp in the 4th and final column. (Just discard the 4th column. We're not using it for anything. )
>
> It's taking HOURS for Python to build the similarity matrix. So what I did was:
>
> head -n 5000 u1.base > practice.base
>
> and I also downloaded the PyPy interpreter for Python 3. Then using PyPy (or pypy or whatever) I ran my program on the first ten thousand lines of data from u1.base stored in the new text file, practice.base. Not a problem!!! I still had to wait a couple minutes, but not a couple hours!!!
>
> Is there a way to make this program work for such a large set of data? I know my program successfully constructs the Similarity Matrix (i.e. similarity between users) for 5,000, 10,000, 20,000 and even 25,000 lines of data. But for 80,000 lines of data the program becomes very slow and overtaxes my CPU. (The fan turns on and the bottom of my laptop starts to get very hot.... a bad sign! )
>
> Does anyone have any recommendations? ( I'm supposed to meet with my prof on Tuesday. I may just explain the problem to him and request a smaller data set to work with. And unfortunately he knows very little about Python. He's primarily a C++ and Java programmer. )
>
> I appreciate the feedback. Thank you!!!
>
> Best,
>
> Douglas Lewit
>
>
> <Homework3_Revision2.py><u1.base><practice2.base><HW3.pdf>_______________________________________________
> Chicago mailing list
> Chicago at python.org
> https://mail.python.org/mailman/listinfo/chicago
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20151110/c0cbbfb2/attachment.html>
More information about the Chicago
mailing list