# [scikit-learn] Adding BM25 to sklearn.feature_extraction.text

Joel Nothman joel.nothman at gmail.com
Sat Jul 2 06:06:21 EDT 2016

```Indeed, the best way to do this with CSR will exploit CSR's internals so
that you only need to deal with the matrix elements that are nonzero. Say
you have the tf matrix in CSR:

doc_len = tf.sum(axis=0)
doc_len_term = # compute me

bm25 = tf  # will operate in-place
bm25.data /= (bm25.data + np.repeat(doc_len_term, np.diff(bm25.indptr)))
bm25.data *= (k1 + 1)

On 2 July 2016 at 03:01, Basil Beirouti <basilbeirouti at gmail.com> wrote:

> Hi Joel,
>
> I'm not by my dev computer right now so I can't show you the code, but the
> problem is that the term frequency -  f(q,D) in that wiki article - appears
> in both the numerator and the denominator. Also, in the denominator, you
> must add a scalar quantity to f(q,D), which is unsupported if f(q,D) is
> coming from a sparse matrix.
>
>  You can factor the equation in different ways but you can't get around
> the main issue that the sparse matrix must appear in the numerator and
> denominator. So instead of doing any actual matrix multiplication I just
> loop on the non-zero elements in the sparse matrix (term-frequency matrix)
> and fill in the new BM25matrix, element by element.
>
> Any suggestions that use actual sparse matrix operations would be
> appreciated. I know I can also loop on the non-zero elements and construct
> a sparse csr_matrix from that using the.indptr attribute etc. but I'm
> hoping there's a way to use matrix operations.
>
> Sincerely,
> Basil Beirouti
>
> On Fri, Jul 1, 2016 at 11:00 AM, <scikit-learn-request at python.org> wrote:
>
>> Send scikit-learn mailing list submissions to
>>         scikit-learn at python.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>         https://mail.python.org/mailman/listinfo/scikit-learn
>> or, via email, send a message with subject or body 'help' to
>>         scikit-learn-request at python.org
>>
>> You can reach the person managing the list at
>>         scikit-learn-owner at python.org
>>
>> than "Re: Contents of scikit-learn digest..."
>>
>>
>> Today's Topics:
>>
>>    1. Adding BM25 to sklearn.feature_extraction.text (Update)
>>       (Basil Beirouti)
>>    2. Re: Adding BM25 to sklearn.feature_extraction.text (Update)
>>       (Joel Nothman)
>>    3. Re: Adding BM25 to sklearn.feature_extraction.text (Update)
>>       (Sebastian Raschka)
>>    4. Re: partial_fit implementation for IsolationForest
>>       (donkey-hotei at cryptolab.net)
>>
>>
>> ----------------------------------------------------------------------
>>
>> Message: 1
>> Date: Thu, 30 Jun 2016 17:23:18 -0500
>> From: Basil Beirouti <basilbeirouti at gmail.com>
>> To: scikit-learn at python.org
>> Subject: [scikit-learn] Adding BM25 to sklearn.feature_extraction.text
>>         (Update)
>> Message-ID:
>>         <
>> CAB4mTg8tMwoA0NwsfXmVtYWqS547F2NOmP5vj3LTCaNqXjqeWQ at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hello everyone,
>>
>> I have successfully created a few versions of the BM25Transformer. I
>> looked
>> at TFIDFTransformer for guidance and I noticed that it outputs a sparse
>> matrix when given a sparse termcount matrix as an input.
>>
>> Unfortunately, the fastest implementation of BM25Transformer that I have
>> been able to come up with does NOT output a sparse matrix, it will return
>> a
>> regular numpy matrix.
>>
>> Benchmarked against the entire 20newsgroups corpus, here is how they
>> perform (assuming input is csr_matrix for all):
>>
>> 1.) finishes in 4 seconds, outputs a regular numpy matrix
>> 2.) finishes in 30 seconds, outputs a dok_matrix
>> 3.) finishes in 130 seconds, outputs a regular numpy matrix
>>
>> It's worth noting that using algorithm 1 and converting the output to a
>> sparse matrix still takes less time than 3, and takes about as long as 2.
>>
>> So my question is, how important is it that my BM25Transformer outputs a
>> sparse matrix?
>>
>> I'm going to try another implementation which looks directly at the data,
>> indices, and indptr attributes of the inputted csr_matrix. I just wanted
>> to
>> check in and see what people thought.
>>
>> Sincerely,
>> Basil Beirouti
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://mail.python.org/pipermail/scikit-learn/attachments/20160630/80852326/attachment-0001.html
>> >
>>
>> ------------------------------
>>
>> Message: 2
>> Date: Fri, 1 Jul 2016 08:38:15 +1000
>> From: Joel Nothman <joel.nothman at gmail.com>
>> To: Scikit-learn user and developer mailing list
>>         <scikit-learn at python.org>
>> Subject: Re: [scikit-learn] Adding BM25 to
>>         sklearn.feature_extraction.text (Update)
>> Message-ID:
>>         <
>> CAAkaFLUB+4gu5cHuYYyc8pqBK4Ews4mkXyBKAvMCVENNPUv98Q at mail.gmail.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> I don't see what about BM25, at least as presented at
>> https://en.wikipedia.org/wiki/Okapi_BM25, should prevent using CSR
>> operations efficiently. Show us your code.
>>
>> On 1 July 2016 at 08:23, Basil Beirouti <basilbeirouti at gmail.com> wrote:
>>
>> > Hello everyone,
>> >
>> > I have successfully created a few versions of the BM25Transformer. I
>> > looked at TFIDFTransformer for guidance and I noticed that it outputs a
>> > sparse matrix when given a sparse termcount matrix as an input.
>> >
>> > Unfortunately, the fastest implementation of BM25Transformer that I have
>> > been able to come up with does NOT output a sparse matrix, it will
>> return a
>> > regular numpy matrix.
>> >
>> > Benchmarked against the entire 20newsgroups corpus, here is how they
>> > perform (assuming input is csr_matrix for all):
>> >
>> > 1.) finishes in 4 seconds, outputs a regular numpy matrix
>> > 2.) finishes in 30 seconds, outputs a dok_matrix
>> > 3.) finishes in 130 seconds, outputs a regular numpy matrix
>> >
>> > It's worth noting that using algorithm 1 and converting the output to a
>> > sparse matrix still takes less time than 3, and takes about as long as
>> 2.
>> >
>> > So my question is, how important is it that my BM25Transformer outputs a
>> > sparse matrix?
>> >
>> > I'm going to try another implementation which looks directly at the
>> data,
>> > indices, and indptr attributes of the inputted csr_matrix. I just
>> wanted to
>> > check in and see what people thought.
>> >
>> > Sincerely,
>> > Basil Beirouti
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> -------------- next part --------------
>> An HTML attachment was scrubbed...
>> URL: <
>> http://mail.python.org/pipermail/scikit-learn/attachments/20160701/de28d786/attachment-0001.html
>> >
>>
>> ------------------------------
>>
>> Message: 3
>> Date: Thu, 30 Jun 2016 18:33:49 -0400
>> From: Sebastian Raschka <mail at sebastianraschka.com>
>> To: Scikit-learn user and developer mailing list
>>         <scikit-learn at python.org>
>> Subject: Re: [scikit-learn] Adding BM25 to
>>         sklearn.feature_extraction.text (Update)
>> Message-ID:
>>         <6411ECB7-BD7C-4960-B847-B3D633DD848A at sebastianraschka.com>
>> Content-Type: text/plain; charset=utf-8
>>
>> Hi, Basil,
>>
>> I?d say runtime may not be the main concern regarding sparse vs. dense.
>> In my opinion, the main reason to use sparse arrays would be memory useage.
>> I.e., text data is typically rather large (esp. high-dimensional, sparse
>> feature vector). So one limitation with scikit-learn is typically memory
>> capacity, especially if you are using multiprocessing via the cv param.
>>
>> PS:
>>
>> > regular numpy matrix
>>
>> I think you mean "numpy array?? (Since there?s a numpy matrix datastruct
>> in numpy as well, however, almost no one uses it)
>>
>> Best,
>> Sebastian
>>
>> > On Jun 30, 2016, at 6:23 PM, Basil Beirouti <basilbeirouti at gmail.com>
>> wrote:
>> >
>> > Hello everyone,
>> >
>> > I have successfully created a few versions of the BM25Transformer. I
>> looked at TFIDFTransformer for guidance and I noticed that it outputs a
>> sparse matrix when given a sparse termcount matrix as an input.
>> >
>> > Unfortunately, the fastest implementation of BM25Transformer that I
>> have been able to come up with does NOT output a sparse matrix, it will
>> return a regular numpy matrix.
>> >
>> > Benchmarked against the entire 20newsgroups corpus, here is how they
>> perform (assuming input is csr_matrix for all):
>> >
>> > 1.) finishes in 4 seconds, outputs a regular numpy matrix
>> > 2.) finishes in 30 seconds, outputs a dok_matrix
>> > 3.) finishes in 130 seconds, outputs a regular numpy matrix
>> >
>> > It's worth noting that using algorithm 1 and converting the output to a
>> sparse matrix still takes less time than 3, and takes about as long as 2.
>> >
>> > So my question is, how important is it that my BM25Transformer outputs
>> a sparse matrix?
>> >
>> > I'm going to try another implementation which looks directly at the
>> data, indices, and indptr attributes of the inputted csr_matrix. I just
>> wanted to check in and see what people thought.
>> >
>> > Sincerely,
>> > Basil Beirouti
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn at python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> ------------------------------
>>
>> Message: 4
>> Date: Fri, 01 Jul 2016 09:48:55 +0200
>> From: donkey-hotei at cryptolab.net
>> To: Scikit-learn user and developer mailing list
>>         <scikit-learn at python.org>
>> Subject: Re: [scikit-learn] partial_fit implementation for
>>         IsolationForest
>> Message-ID: <89984291c81434d8be46cdc4c4527b44 at cryptolab.net>
>> Content-Type: text/plain; charset=US-ASCII; format=flowed
>>
>> hi Olivier,
>>
>>
>> > What you describe is quite different from what sklearn models
>> > typically do with partial_fit. partial_fit is more about out-of-core /
>> > streaming fitting rather than true online learning with explicit
>> > forgetting.
>> >
>> > In particular what you suggest would not accept calling partial_fit
>> > with very small chunks (e.g. from tens to a hundred samples at a time)
>> > because that would not be enough to develop deep isolation trees and
>> > would harm the performance of the resulting isolation forest.
>>
>> I see, suppose I should check to see how the depth of these trees
>> changes when fitting on small chunks as opposed to large chunks -.
>> either way, refreshing on at least 1000 samples has proven to work O.K
>> here in the face of concept drift
>>
>> > If the problem is true online learning (tracking a stream of training
>> > data with expected shifts in its distribution) I think it's better to
>> > devise a dedicated API that does not try to mimic the scikit-learn API
>> > (for this specific part). There will typically have to be an
>> > additional hyperparameter to control how much the model should
>> > remember about old samples.
>>
>> ok, i've been using a parameter called 'n_more_estimators' that decides
>> how many trees are dropped/added. maybe it is not the best way
>>
>> > If the problem is more about out-of-core, then partial_fit is suitable
>> > but the trees should grow and get reorganized progressively (as
>> > pointed by others in previous comments).
>>
>> maybe a name like "online_fit" would be more appropriate? it would be
>> nice to know what exactly is meant by "reorganized" , so far ive been
>> merely dropping the oldest trees
>>
>> > BTW,  I would be curious to know more about the kind of anomaly
>> > detection problem where you found IsolationForests to work well.
>>
>> The problem is intrusion detection at the application layer, features
>> are parsed from http audit logs
>>
>> ty
>>
>>
>> ------------------------------
>>
>> Subject: Digest Footer
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> ------------------------------
>>
>> End of scikit-learn Digest, Vol 4, Issue 1
>> ******************************************
>>
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160702/c1a7b4aa/attachment-0001.html>
```