<div dir="ltr">Hi Joel,<div><br></div><div>I'm not by my dev computer right now so I can't show you the code, but the problem is that the term frequency -  f(q,D) in that wiki article - appears in both the numerator and the denominator. Also, in the denominator, you must add a scalar quantity to f(q,D), which is unsupported if f(q,D) is coming from a sparse matrix.</div><div><br></div><div> You can factor the equation in different ways but you can't get around the main issue that the sparse matrix must appear in the numerator and denominator. So instead of doing any actual matrix multiplication I just loop on the non-zero elements in the sparse matrix (term-frequency matrix) and fill in the new BM25matrix, element by element. </div><div><br></div><div>Any suggestions that use actual sparse matrix operations would be appreciated. I know I can also loop on the non-zero elements and construct a sparse csr_matrix from that using the.indptr attribute etc. but I'm hoping there's a way to use matrix operations. </div><div><br></div><div>Sincerely,</div><div>Basil Beirouti</div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jul 1, 2016 at 11:00 AM,  <span dir="ltr"><<a href="mailto:scikit-learn-request@python.org" target="_blank">scikit-learn-request@python.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Send scikit-learn mailing list submissions to<br>

        <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<br>

To subscribe or unsubscribe via the World Wide Web, visit<br>

        <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>

or, via email, send a message with subject or body 'help' to<br>

        <a href="mailto:scikit-learn-request@python.org">scikit-learn-request@python.org</a><br>

<br>

You can reach the person managing the list at<br>

        <a href="mailto:scikit-learn-owner@python.org">scikit-learn-owner@python.org</a><br>

<br>

When replying, please edit your Subject line so it is more specific<br>

than "Re: Contents of scikit-learn digest..."<br>

<br>

<br>

Today's Topics:<br>

<br>

   1. Adding BM25 to sklearn.feature_extraction.text (Update)<br>

      (Basil Beirouti)<br>

   2. Re: Adding BM25 to sklearn.feature_extraction.text (Update)<br>

      (Joel Nothman)<br>

   3. Re: Adding BM25 to sklearn.feature_extraction.text (Update)<br>

      (Sebastian Raschka)<br>

   4. Re: partial_fit implementation for IsolationForest<br>

      (<a href="mailto:donkey-hotei@cryptolab.net">donkey-hotei@cryptolab.net</a>)<br>

<br>

<br>

----------------------------------------------------------------------<br>

<br>

Message: 1<br>

Date: Thu, 30 Jun 2016 17:23:18 -0500<br>

From: Basil Beirouti <<a href="mailto:basilbeirouti@gmail.com">basilbeirouti@gmail.com</a>><br>

To: <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

Subject: [scikit-learn] Adding BM25 to sklearn.feature_extraction.text<br>

        (Update)<br>

Message-ID:<br>

        <<a href="mailto:CAB4mTg8tMwoA0NwsfXmVtYWqS547F2NOmP5vj3LTCaNqXjqeWQ@mail.gmail.com">CAB4mTg8tMwoA0NwsfXmVtYWqS547F2NOmP5vj3LTCaNqXjqeWQ@mail.gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

Hello everyone,<br>

<br>

I have successfully created a few versions of the BM25Transformer. I looked<br>

at TFIDFTransformer for guidance and I noticed that it outputs a sparse<br>

matrix when given a sparse termcount matrix as an input.<br>

<br>

Unfortunately, the fastest implementation of BM25Transformer that I have<br>

been able to come up with does NOT output a sparse matrix, it will return a<br>

regular numpy matrix.<br>

<br>

Benchmarked against the entire 20newsgroups corpus, here is how they<br>

perform (assuming input is csr_matrix for all):<br>

<br>

1.) finishes in 4 seconds, outputs a regular numpy matrix<br>

2.) finishes in 30 seconds, outputs a dok_matrix<br>

3.) finishes in 130 seconds, outputs a regular numpy matrix<br>

<br>

It's worth noting that using algorithm 1 and converting the output to a<br>

sparse matrix still takes less time than 3, and takes about as long as 2.<br>

<br>

So my question is, how important is it that my BM25Transformer outputs a<br>

sparse matrix?<br>

<br>

I'm going to try another implementation which looks directly at the data,<br>

indices, and indptr attributes of the inputted csr_matrix. I just wanted to<br>

check in and see what people thought.<br>

<br>

Sincerely,<br>

Basil Beirouti<br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160630/80852326/attachment-0001.html" rel="noreferrer" target="_blank">http://mail.python.org/pipermail/scikit-learn/attachments/20160630/80852326/attachment-0001.html</a>><br>

<br>

------------------------------<br>

<br>

Message: 2<br>

Date: Fri, 1 Jul 2016 08:38:15 +1000<br>

From: Joel Nothman <<a href="mailto:joel.nothman@gmail.com">joel.nothman@gmail.com</a>><br>

To: Scikit-learn user and developer mailing list<br>

        <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>><br>

Subject: Re: [scikit-learn] Adding BM25 to<br>

        sklearn.feature_extraction.text (Update)<br>

Message-ID:<br>

        <<a href="mailto:CAAkaFLUB%2B4gu5cHuYYyc8pqBK4Ews4mkXyBKAvMCVENNPUv98Q@mail.gmail.com">CAAkaFLUB+4gu5cHuYYyc8pqBK4Ews4mkXyBKAvMCVENNPUv98Q@mail.gmail.com</a>><br>

Content-Type: text/plain; charset="utf-8"<br>

<br>

I don't see what about BM25, at least as presented at<br>

<a href="https://en.wikipedia.org/wiki/Okapi_BM25" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/Okapi_BM25</a>, should prevent using CSR<br>

operations efficiently. Show us your code.<br>

<br>

On 1 July 2016 at 08:23, Basil Beirouti <<a href="mailto:basilbeirouti@gmail.com">basilbeirouti@gmail.com</a>> wrote:<br>

<br>

> Hello everyone,<br>

><br>

> I have successfully created a few versions of the BM25Transformer. I<br>

> looked at TFIDFTransformer for guidance and I noticed that it outputs a<br>

> sparse matrix when given a sparse termcount matrix as an input.<br>

><br>

> Unfortunately, the fastest implementation of BM25Transformer that I have<br>

> been able to come up with does NOT output a sparse matrix, it will return a<br>

> regular numpy matrix.<br>

><br>

> Benchmarked against the entire 20newsgroups corpus, here is how they<br>

> perform (assuming input is csr_matrix for all):<br>

><br>

> 1.) finishes in 4 seconds, outputs a regular numpy matrix<br>

> 2.) finishes in 30 seconds, outputs a dok_matrix<br>

> 3.) finishes in 130 seconds, outputs a regular numpy matrix<br>

><br>

> It's worth noting that using algorithm 1 and converting the output to a<br>

> sparse matrix still takes less time than 3, and takes about as long as 2.<br>

><br>

> So my question is, how important is it that my BM25Transformer outputs a<br>

> sparse matrix?<br>

><br>

> I'm going to try another implementation which looks directly at the data,<br>

> indices, and indptr attributes of the inputted csr_matrix. I just wanted to<br>

> check in and see what people thought.<br>

><br>

> Sincerely,<br>

> Basil Beirouti<br>

><br>

> _______________________________________________<br>

> scikit-learn mailing list<br>

> <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

> <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>

><br>

><br>

-------------- next part --------------<br>

An HTML attachment was scrubbed...<br>

URL: <<a href="http://mail.python.org/pipermail/scikit-learn/attachments/20160701/de28d786/attachment-0001.html" rel="noreferrer" target="_blank">http://mail.python.org/pipermail/scikit-learn/attachments/20160701/de28d786/attachment-0001.html</a>><br>

<br>

------------------------------<br>

<br>

Message: 3<br>

Date: Thu, 30 Jun 2016 18:33:49 -0400<br>

From: Sebastian Raschka <<a href="mailto:mail@sebastianraschka.com">mail@sebastianraschka.com</a>><br>

To: Scikit-learn user and developer mailing list<br>

        <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>><br>

Subject: Re: [scikit-learn] Adding BM25 to<br>

        sklearn.feature_extraction.text (Update)<br>

Message-ID:<br>

        <<a href="mailto:6411ECB7-BD7C-4960-B847-B3D633DD848A@sebastianraschka.com">6411ECB7-BD7C-4960-B847-B3D633DD848A@sebastianraschka.com</a>><br>

Content-Type: text/plain; charset=utf-8<br>

<br>

Hi, Basil,<br>

<br>

I?d say runtime may not be the main concern regarding sparse vs. dense. In my opinion, the main reason to use sparse arrays would be memory useage. I.e., text data is typically rather large (esp. high-dimensional, sparse feature vector). So one limitation with scikit-learn is typically memory capacity, especially if you are using multiprocessing via the cv param.<br>

<br>

PS:<br>

<br>

> regular numpy matrix<br>

<br>

I think you mean "numpy array?? (Since there?s a numpy matrix datastruct in numpy as well, however, almost no one uses it)<br>

<br>

Best,<br>

Sebastian<br>

<br>

> On Jun 30, 2016, at 6:23 PM, Basil Beirouti <<a href="mailto:basilbeirouti@gmail.com">basilbeirouti@gmail.com</a>> wrote:<br>

><br>

> Hello everyone,<br>

><br>

> I have successfully created a few versions of the BM25Transformer. I looked at TFIDFTransformer for guidance and I noticed that it outputs a sparse matrix when given a sparse termcount matrix as an input.<br>

><br>

> Unfortunately, the fastest implementation of BM25Transformer that I have been able to come up with does NOT output a sparse matrix, it will return a regular numpy matrix.<br>

><br>

> Benchmarked against the entire 20newsgroups corpus, here is how they perform (assuming input is csr_matrix for all):<br>

><br>

> 1.) finishes in 4 seconds, outputs a regular numpy matrix<br>

> 2.) finishes in 30 seconds, outputs a dok_matrix<br>

> 3.) finishes in 130 seconds, outputs a regular numpy matrix<br>

><br>

> It's worth noting that using algorithm 1 and converting the output to a sparse matrix still takes less time than 3, and takes about as long as 2.<br>

><br>

> So my question is, how important is it that my BM25Transformer outputs a sparse matrix?<br>

><br>

> I'm going to try another implementation which looks directly at the data, indices, and indptr attributes of the inputted csr_matrix. I just wanted to check in and see what people thought.<br>

><br>

> Sincerely,<br>

> Basil Beirouti<br>

> _______________________________________________<br>

> scikit-learn mailing list<br>

> <a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

> <a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>

<br>

<br>

<br>

------------------------------<br>

<br>

Message: 4<br>

Date: Fri, 01 Jul 2016 09:48:55 +0200<br>

From: <a href="mailto:donkey-hotei@cryptolab.net">donkey-hotei@cryptolab.net</a><br>

To: Scikit-learn user and developer mailing list<br>

        <<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a>><br>

Subject: Re: [scikit-learn] partial_fit implementation for<br>

        IsolationForest<br>

Message-ID: <<a href="mailto:89984291c81434d8be46cdc4c4527b44@cryptolab.net">89984291c81434d8be46cdc4c4527b44@cryptolab.net</a>><br>

Content-Type: text/plain; charset=US-ASCII; format=flowed<br>

<br>

hi Olivier,<br>

<br>

thanks for your response.<br>

<br>

> What you describe is quite different from what sklearn models<br>

> typically do with partial_fit. partial_fit is more about out-of-core /<br>

> streaming fitting rather than true online learning with explicit<br>

> forgetting.<br>

><br>

> In particular what you suggest would not accept calling partial_fit<br>

> with very small chunks (e.g. from tens to a hundred samples at a time)<br>

> because that would not be enough to develop deep isolation trees and<br>

> would harm the performance of the resulting isolation forest.<br>

<br>

I see, suppose I should check to see how the depth of these trees<br>

changes when fitting on small chunks as opposed to large chunks -.<br>

either way, refreshing on at least 1000 samples has proven to work O.K<br>

here in the face of concept drift<br>

<br>

> If the problem is true online learning (tracking a stream of training<br>

> data with expected shifts in its distribution) I think it's better to<br>

> devise a dedicated API that does not try to mimic the scikit-learn API<br>

> (for this specific part). There will typically have to be an<br>

> additional hyperparameter to control how much the model should<br>

> remember about old samples.<br>

<br>

ok, i've been using a parameter called 'n_more_estimators' that decides<br>

how many trees are dropped/added. maybe it is not the best way<br>

<br>

> If the problem is more about out-of-core, then partial_fit is suitable<br>

> but the trees should grow and get reorganized progressively (as<br>

> pointed by others in previous comments).<br>

<br>

maybe a name like "online_fit" would be more appropriate? it would be<br>

nice to know what exactly is meant by "reorganized" , so far ive been<br>

merely dropping the oldest trees<br>

<br>

> BTW,  I would be curious to know more about the kind of anomaly<br>

> detection problem where you found IsolationForests to work well.<br>

<br>

The problem is intrusion detection at the application layer, features<br>

are parsed from http audit logs<br>

<br>

ty<br>

<br>

<br>

------------------------------<br>

<br>

Subject: Digest Footer<br>

<br>

_______________________________________________<br>

scikit-learn mailing list<br>

<a href="mailto:scikit-learn@python.org">scikit-learn@python.org</a><br>

<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailman/listinfo/scikit-learn</a><br>

<br>

<br>

------------------------------<br>

<br>

End of scikit-learn Digest, Vol 4, Issue 1<br>

******************************************<br>

</blockquote></div><br></div></div></div>