[Chicago] Need advice on this project.

Mark Graves mgraves87 at gmail.com
Tue Nov 10 10:37:36 EST 2015


I think I must have screwed this up, can someone point out my errors?

I worked based off Doug's code, then attempted to dictify the results to
minimize lookup times in that filter function.

Full disclosure: I was only working based off no errors, with no knowledge
of the algorithm implementation.

code:

https://gist.github.com/gravesmedical/58a6b665b553c1294b56

On Tue, Nov 10, 2015 at 8:57 AM, Ross Heflin <heflin.rosst at gmail.com> wrote:

> Might be time to profile.
> Run your similarity matrix builder with the large dataset against cProfile
> (or whatever works on PyPy) for some time (30 min) and see where its
> spending the majority of its time.
>
> -Ross
>
> On Mon, Nov 9, 2015 at 7:44 PM, Lewit, Douglas <d-lewit at neiu.edu> wrote:
>
>> Hey guys,
>>
>> I need some advice on this one.  I'm attaching the homework assignment so
>> that you understand what I'm trying to do.  I went as far as the
>> construction of the Similarity Matrix, which is a matrix of Pearson
>> correlation coefficients.
>>
>> My problem is this.  u1.base (which is also attached) contains Users
>> (first column), Items (second column), Ratings (third column) and finally
>> the time stamp in the 4th and final column.  (Just discard the 4th column.
>> We're not using it for anything. )
>>
>> It's taking HOURS for Python to build the similarity matrix.  So what I
>> did was:
>>
>> *head -n 5000 u1.base > practice.base*
>>
>> and I also downloaded the PyPy interpreter for Python 3.  Then using PyPy
>> (or pypy or whatever) I ran my program on the first ten thousand lines of
>> data from u1.base stored in the new text file, practice.base.  Not a
>> problem!!!  I still had to wait a couple minutes, but not a couple hours!!!
>>
>>
>> Is there a way to make this program work for such a large set of data?  I
>> know my program successfully constructs the Similarity Matrix (i.e.
>> similarity between users) for 5,000, 10,000, 20,000 and even 25,000 lines
>> of data.  But for 80,000 lines of data the program becomes very slow and
>> overtaxes my CPU.  (The fan turns on and the bottom of my laptop starts to
>> get very hot.... a bad sign! )
>>
>> Does anyone have any recommendations?  ( I'm supposed to meet with my
>> prof on Tuesday.  I may just explain the problem to him and request a
>> smaller data set to work with.  And unfortunately he knows very little
>> about Python.  He's primarily a C++ and Java programmer. )
>>
>> I appreciate the feedback.  Thank you!!!
>>
>> Best,
>>
>> Douglas Lewit
>>
>>
>>
>> _______________________________________________
>> Chicago mailing list
>> Chicago at python.org
>> https://mail.python.org/mailman/listinfo/chicago
>>
>>
>
>
> --
> From the "desk" of Ross Heflin
> phone number: (847) <23,504,826th decimal place of pi>
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> https://mail.python.org/mailman/listinfo/chicago
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20151110/3b5c3c0a/attachment-0001.html>


More information about the Chicago mailing list