<div dir="ltr"><div class="gmail_extra">Hi everyone,</div><div class="gmail_extra"><br></div><div class="gmail_extra">to put it succinctly, here's the BM25 equation:</div><div class="gmail_extra"><br></div><div class="gmail_extra">f(w,D) * (k+1) / (k*B + f(w,D))</div><div class="gmail_extra"><br>where w is the word, and D is the document (corresponding to rows and columns, respectively). f is a sparse matrix because only a fraction of the whole vocabulary of words appears in any given single document. </div><div class="gmail_extra"><br></div><div class="gmail_extra">B is a function of only the document, but it doesn't matter, you can think of it as a constant if you want. </div><div class="gmail_extra"><br></div><div class="gmail_extra">The problem is since f(w,D) is almost always zero, I only need to do the calculation (ie. multiply by (k+1) then divide by (k*B + f(w,D))) when f(w,D) is not zero. Is there a clever way to do this with masks? </div><div class="gmail_extra"><br></div><div class="gmail_extra">You can refactor the above equation to get this:</div><div class="gmail_extra"><br></div><div class="gmail_extra">(k+1)/(k*B/f(w,D) + 1) but alas we still have f(w,D) appearing in a denominator, which is bad (because of dividing by zero). </div><div class="gmail_extra"><br></div><div class="gmail_extra">So anyway, currently I am converting to a coo_matrix and iterator through the non-zero values like this:</div><div class="gmail_extra"><br></div><div class="gmail_extra"><pre style="margin-top:0px;margin-bottom:1em;padding:5px;border:0px;font-size:13px;width:auto;max-height:600px;overflow:auto;font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace,sans-serif;color:rgb(57,51,24);word-wrap:normal;background-color:rgb(239,240,241)"><code style="margin:0px;padding:0px;border:0px;font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace,sans-serif;white-space:inherit"><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">    cx </span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">=</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)"> x</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">.</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">tocoo</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">()</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">    
    </span><span style="margin:0px;padding:0px;border:0px;color:rgb(16,16,148)">for</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)"> i</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">,</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">j</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">,</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">v </span><span style="margin:0px;padding:0px;border:0px;color:rgb(16,16,148)">in</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)"> itertools</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">.</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">izip</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">(</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">cx</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">.</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">row</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">,</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)"> cx</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">.</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">col</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">,</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)"> cx</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">.</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">data</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">):</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">
        </span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">(</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">i</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">,</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">j</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">,</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">v</span><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)">)</span></code></pre><pre style="margin-top:0px;margin-bottom:1em;padding:5px;border:0px;font-size:13px;width:auto;max-height:600px;overflow:auto;font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace,sans-serif;color:rgb(57,51,24);word-wrap:normal;background-color:rgb(239,240,241)"><code style="margin:0px;padding:0px;border:0px;font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace,sans-serif;white-space:inherit"><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)"><br></span></code></pre><pre style="margin-top:0px;margin-bottom:1em;padding:5px;border:0px;font-size:13px;width:auto;max-height:600px;overflow:auto;font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace,sans-serif;color:rgb(57,51,24);word-wrap:normal"><code style="margin:0px;padding:0px;border:0px;font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace,sans-serif;white-space:inherit"><span style="margin:0px;padding:0px;border:0px;color:rgb(48,51,54)"><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;white-space:normal;background-color:rgb(255,255,255)">That iterator is incredibly fast, but unfortunately coo_matrix does not support assignment. So I create a new copy of either a dok sparse matrix or a regular numpy array and assign to that. </span></span></code></pre><pre style="margin-top:0px;margin-bottom:1em;padding:5px;border:0px;width:auto;max-height:600px;overflow:auto;word-wrap:normal"><font face="arial, sans-serif"><span style="white-space:normal">I could also deal directly with the .data, .indptr, and indices attributes of csr_matrix, and see if it's possible to create a copy of .data attribute and update the values accordingly. I was hoping somebody had encountered this type of issue before.</span></font></pre><pre style="margin-top:0px;margin-bottom:1em;padding:5px;border:0px;width:auto;max-height:600px;overflow:auto;word-wrap:normal"><font face="arial, sans-serif"><span style="white-space:normal">Sincerely,</span></font></pre><pre style="margin-top:0px;margin-bottom:1em;padding:5px;border:0px;width:auto;max-height:600px;overflow:auto;word-wrap:normal"><font face="arial, sans-serif"><span style="white-space:normal">Basil Beirouti</span></font></pre></div></div>