<meta http-equiv="content-type" content="text/html; charset=utf-8"><span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; ">Hello,<br><br> I am having trouble with performance when trying to create a cross </span><div>
<span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px; ">tabulation using numpy. Ideally, I would calculate each cell in the<br>cross tabulation separately because this gives me the greatest amount<br>
of flexibility. I have included some sample code as a reference and<br>am really looking for better approaches to the simpleLoop method. So<br>far the histogram2d and histogramdd methods seem to outperform any<br>code I write by a factor of about 100, at least. I chalk this up to I<br>
just don't understand enough about numpy, yet. Any help would be<br>appreciated.<br><br>Here is the test code:<br>import numpy as np<br>import time<br>import random<br><br><br># Create a simple loop and count up the number of matching cases<br>
# Basic cross tabulation or histogram of the data<br># This approach is prefered because of the need to customize the<br>calculation potentially for each cell.<br>def simpleLoop(c):<br><br> #number of items per inner loop<br>
a_cnt = len(np.unique(c[:,0]))<br> b_cnt = len(np.unique(c[:,1]))<br> idx = 0<br> result = np.zeros(b_cnt * a_cnt)<br> for i in np.unique(c[:,0]):<br> for j in np.unique(c[:,1]):<br>
result[idx] = np.sum(1*(c[:,0] == i) & (c[:,1] == j))<br> idx += 1<br><br> result.resize(len(result)/b_cnt,b_cnt)<br> return result<br><br><br># Use numpys histogram method to calculate the matrix of combinations<br>
and the number of cases in each one.<br>def simpleHistogram(c):<br><br> #number of items per inner loop<br> return np.histogramdd((c[:,0],c[:,1]), bins=[np.unique(c[:,<br>0]),range(1,11)])<br><br><br># Variation1 of simple histogram<br>
def simpleHistogram1(c):<br><br> #number of items per inner loop<br> results = []<br> for i in np.unique(c[:,1]):<br> results.append(np.histogramdd((c[:,0][c[:,1]==i]),<br>bins=[np.unique(c[:,0])]) or 0)<br>
<br> return np.column_stack([result[0] for result in results])<br><br>if __name__ == '__main__':<br> a = np.random.randint(1,900,200000)<br> b = np.random.randint(1,10,200000)<br> c = np.column_stack((a,b))<br>
<br> print '---- Simple Loop ----'<br> start = time.time()<br> results = simpleLoop(c)<br> print results[0]<br> print time.time() - start<br><br> print '---- Histogram dd no looping ----'<br>
start = time.time()<br> results = simpleHistogram(c)<br> print results[0][0]<br> print time.time() - start<br><br> print '---- Histogram run 1 time for each item in column 1 (10 times)<br>
----'<br> start = time.time()<br> results = simpleHistogram1(c)<br> print results[0]<br> print time.time() - start</span></div>