[scikit-learn] adding BM25 relevance function

Wed Jun 15 01:01:48 EDT 2016

Hello Pavel and Joel,

I forked the repository and cloned it on my machine. I'm using pycharm on a
Mac, and while looking at text.py, I'm getting an unresolved reference for
"xrange" at line 28:

from ..externals.six.moves import range

Pycharm says Function 'six.py' is too large to analyze, so I'm not
sure if this error is somehow related to that. I decided to try to
build the code as a sanity check but I can't find any reliable
instructions as to how to do that. Naively, I opened terminal and cd
to the directory above "scikit-learn" folder (where I had cloned my
fork) and tried to run:

$ python3 setup.py install

Which didn't work. I got this error:

ImportError: No module named 'sklearn'

Can someone point me in the right direction? And how can the code try
to import sklearn if it doesn't exist yet? Note I haven't installed
the release version of scikit-learn using pip or any other tool, but I
should be able to bootstrap it from the source code, right?

Here's the full error message if it helps. Forgive me if it's a silly
mistake, but I haven't found any reliable guidelines online.

  File "setup.py", line 84, in <module>

    from numpy.distutils.core import setup

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/core.py",
line 26, in <module>

    from numpy.distutils.command import config, config_compiler, \

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/command/build_ext.py",
line 18, in <module>

    from numpy.distutils.system_info import combine_paths

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/system_info.py",
line 232, in <module>

    triplet = str(p.communicate()[0].decode().strip())

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py",
line 791, in communicate

    stdout = _eintr_retry_call(self.stdout.read)

  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py",
line 476, in _eintr_retry_call

    return func(*args)

KeyboardInterrupt

Basils-MacBook-Pro:sklearn basilbeirouti$ python3 setup.py install

non-existing path in '__check_build': '_check_build.c'

Appending sklearn.__check_build configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.__check_build')

Appending sklearn._build_utils configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn._build_utils')

Appending sklearn.covariance configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance')

Appending sklearn.covariance/tests configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance/tests')

Appending sklearn.cross_decomposition configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.cross_decomposition')

Appending sklearn.cross_decomposition/tests configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to
'sklearn.cross_decomposition/tests')

Appending sklearn.feature_selection configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection')

Appending sklearn.feature_selection/tests configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to
'sklearn.feature_selection/tests')

Appending sklearn.gaussian_process configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process')

Appending sklearn.gaussian_process/tests configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to
'sklearn.gaussian_process/tests')

Appending sklearn.mixture configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture')

Appending sklearn.mixture/tests configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture/tests')

Appending sklearn.model_selection configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection')

Appending sklearn.model_selection/tests configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to
'sklearn.model_selection/tests')

Appending sklearn.neural_network configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network')

Appending sklearn.neural_network/tests configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to
'sklearn.neural_network/tests')

Appending sklearn.preprocessing configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing')

Appending sklearn.preprocessing/tests configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing/tests')

Appending sklearn.semi_supervised configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised')

Appending sklearn.semi_supervised/tests configuration to sklearn

Ignoring attempt to set 'name' (from 'sklearn' to
'sklearn.semi_supervised/tests')

Warning: Assuming default configuration
(./_build_utils/{setup__build_utils,setup}.py was not found)Warning:
Assuming default configuration
(./covariance/{setup_covariance,setup}.py was not found)Warning:
Assuming default configuration
(./covariance/tests/setup_covariance/{setup_covariance/tests,setup}.py
was not found)Warning: Assuming default configuration
(./cross_decomposition/{setup_cross_decomposition,setup}.py was not
found)Warning: Assuming default configuration
(./cross_decomposition/tests/setup_cross_decomposition/{setup_cross_decomposition/tests,setup}.py
was not found)Warning: Assuming default configuration
(./feature_selection/{setup_feature_selection,setup}.py was not
found)Warning: Assuming default configuration
(./feature_selection/tests/setup_feature_selection/{setup_feature_selection/tests,setup}.py
was not found)Warning: Assuming default configuration
(./gaussian_process/{setup_gaussian_process,setup}.py was not
found)Warning: Assuming default configuration
(./gaussian_process/tests/setup_gaussian_process/{setup_gaussian_process/tests,setup}.py
was not found)Warning: Assuming default configuration
(./mixture/{setup_mixture,setup}.py was not found)Warning: Assuming
default configuration
(./mixture/tests/setup_mixture/{setup_mixture/tests,setup}.py was not
found)Warning: Assuming default configuration
(./model_selection/{setup_model_selection,setup}.py was not
found)Warning: Assuming default configuration
(./model_selection/tests/setup_model_selection/{setup_model_selection/tests,setup}.py
was not found)Warning: Assuming default configuration
(./neural_network/{setup_neural_network,setup}.py was not
found)Warning: Assuming default configuration
(./neural_network/tests/setup_neural_network/{setup_neural_network/tests,setup}.py
was not found)Warning: Assuming default configuration
(./preprocessing/{setup_preprocessing,setup}.py was not found)Warning:
Assuming default configuration
(./preprocessing/tests/setup_preprocessing/{setup_preprocessing/tests,setup}.py
was not found)Warning: Assuming default configuration
(./semi_supervised/{setup_semi_supervised,setup}.py was not
found)Warning: Assuming default configuration
(./semi_supervised/tests/setup_semi_supervised/{setup_semi_supervised/tests,setup}.py
was not found)Traceback (most recent call last):

  File "setup.py", line 85, in <module>

    setup(**configuration(top_path='').todict())

  File "setup.py", line 44, in configuration

    config.add_subpackage('cluster')

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py",
line 1003, in add_subpackage

    caller_level = 2)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py",
line 972, in get_subpackage

    caller_level = caller_level + 1)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py",
line 884, in _get_configuration_from_setup_py

    ('.py', 'U', 1))

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py",
line 234, in load_module

    return load_source(name, filename, file)

  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py",
line 172, in load_source

    module = _load(spec)

  File "<frozen importlib._bootstrap>", line 693, in _load

  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked

  File "<frozen importlib._bootstrap_external>", line 662, in exec_module

  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed

  File "./cluster/setup.py", line 8, in <module>

    from sklearn._build_utils import get_blas_info

ImportError: No module named 'sklearn'

On Tue, Jun 14, 2016 at 11:41 AM, <scikit-learn-request at python.org> wrote:

> Send scikit-learn mailing list submissions to
>         scikit-learn at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-request at python.org
>
> You can reach the person managing the list at
>         scikit-learn-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
>
>
> Today's Topics:
>
>    1. Re: Adding BM25 relevance function (Pavel Soriano)
>    2. Re: The culture of commit squashing (Andreas Mueller)
>    3. Re: The culture of commit squashing (Tom DLT)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 14 Jun 2016 16:11:10 +0000
> From: Pavel Soriano <sorianopavel at gmail.com>
> To: Scikit-learn user and developer mailing list
>         <scikit-learn at python.org>
> Subject: Re: [scikit-learn] Adding BM25 relevance function
> Message-ID:
>         <
> CAN0wWk93r2aw9No65CGiCW5hQG7-oFYVZaMJQpXpegTXMSqPLg at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hey,
>
> Good thing that you are trying to finish this.
>
> Well, I looked into my old notes, and the Delta tf-idf comes from the
> "Delta
> TFIDF: An Improved Feature Space for Sentiment Analysis"
> <http://ebiquity.umbc.edu/_file_directory_/papers/446.pdf> paper. I guess
> it is not very popular and apparently it has a drawback: it does not take
> into account the number of times a word occurs in each document while
> calculating the distribution amongst classes. At least that is what I wrote
> on my notes...
>
> As for the delta idf... If it helps, I can look into my old code cause I do
> not know what I was talking about. I guess it has to do somehow with the
> paper cited before.
>
> Cheers,
>
> Pavel Soriano
>
>
>
>
> On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti <basilbeirouti at gmail.com>
> wrote:
>
> > Hi Joel,
> >
> > Thanks for your response and for digging up that archived thread, it
> gives
> > me a lot of clarity.
> >
> > I see your point about BM25, but I think in most cases where TFIDF makes
> > sense, BM25 makes sense as well, but it could be "overkill".
> >
> > Consider that TFIDF does not produce normalized results either
> > <
> http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py
> >,
> > If BM25 requires dimensionality reduction (eg. using LSA) , so too would
> > TFIDF. The term-document matrix is the same size no matter which
> weighting
> > scheme is used. The only difference is that BM25 produces better results
> > when the corpus is large enough that the term frequency in a document,
> and
> > the document frequency in the corpus, can vary considerably across a
> broad
> > range of values.Maybe you could even say TFIDF and BM25 are the same
> > equation except BM25 has a few additional hyperparameters (b and k).
> >
> > So is the advantage that BM25 provides for large diverse corpora with it?
> > or is it marginal? Perhaps you can point me to some more examples where
> > TFIDF is used (in supervised setting preferably) and I can plug in BM25
> in
> > place of TFIDF and see how it compares. Here are some I found:
> >
> >
> >
> http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
> > *(supervised)*
> >
> >
> http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#example-text-document-clustering-py
> > (*unsupervised)*
> >
> > Thank you!
> > Basil
> >
> > PS: By the way, I'm not familiar with the delta-idf transform that Pavel
> > mentions in the archive you linked, I'll have to delve deeper into that.
> I
> > agree with the response to Pavel that he should be putting it in a
> separate
> > class, not adding on to the TFIDF. I think it would take me about 6-8
> weeks
> > to adapt my code to the fit transform model and submit a pull request.
> >
> >
> >
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> --
> Pavel SORIANO
>
> PhD Student
> ERIC Laboratory
> Universit? de Lyon
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/scikit-learn/attachments/20160614/cbe49979/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 2
> Date: Tue, 14 Jun 2016 12:13:29 -0400
> From: Andreas Mueller <t3kcit at gmail.com>
> To: Scikit-learn user and developer mailing list
>         <scikit-learn at python.org>
> Subject: Re: [scikit-learn] The culture of commit squashing
> Message-ID: <57602D29.1070203 at gmail.com>
> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
>
> I'm +1 for using the button when appropriate.
> I think it should be up to the merging person to make a call whether a
> squash is a better
> logical unit than all the commits.
> I would set like a soft limit at ~5 commits or something. If your PR has
> more than 5 separate
> big logical units, it's probably too big.
>
> The button is enabled in the settings but I can't see it.
> Am I being stupid?
>
> On 06/14/2016 06:58 AM, Joel Nothman wrote:
> > Sounds good to me. Thank goodness someone reads the documentation!
> >
> > On 14 June 2016 at 19:51, Alexandre Gramfort
> > <alexandre.gramfort at telecom-paristech.fr
> > <mailto:alexandre.gramfort at telecom-paristech.fr>> wrote:
> >
> >     > We could stop squashing during development, and use the new
> Squash-and-Merge
> >     > button on GitHub.
> >     > What do you think?
> >
> >     +1
> >
> >     the reason I see for squashing during dev is to avoid killing the
> >     browser when reviewing. It really rarely happens though.
> >
> >     A
> >     _______________________________________________
> >     scikit-learn mailing list
> >     scikit-learn at python.org <mailto:scikit-learn at python.org>
> >     https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/scikit-learn/attachments/20160614/135d4c27/attachment-0001.html
> >
>
> ------------------------------
>
> Message: 3
> Date: Tue, 14 Jun 2016 18:40:39 +0200
> From: Tom DLT <tom.duprelatour at orange.fr>
> To: Scikit-learn user and developer mailing list
>         <scikit-learn at python.org>
> Subject: Re: [scikit-learn] The culture of commit squashing
> Message-ID:
>         <CAGKmC=sRMbwo1Pjm=
> ph3R6OqsmvZUZDBMjvj09yJwkk0+Yq4EA at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> @Andreas
> It's a bit hidden: You need to click on "Merge pull-request", then do *not*
> click on "Confirm merge", but on the small arrow to the right, and select
> "Squash and merge".
>
> 2016-06-14 18:13 GMT+02:00 Andreas Mueller <t3kcit at gmail.com>:
>
> > I'm +1 for using the button when appropriate.
> > I think it should be up to the merging person to make a call whether a
> > squash is a better
> > logical unit than all the commits.
> > I would set like a soft limit at ~5 commits or something. If your PR has
> > more than 5 separate
> > big logical units, it's probably too big.
> >
> > The button is enabled in the settings but I can't see it.
> > Am I being stupid?
> >
> >
> > On 06/14/2016 06:58 AM, Joel Nothman wrote:
> >
> > Sounds good to me. Thank goodness someone reads the documentation!
> >
> > On 14 June 2016 at 19:51, Alexandre Gramfort <
> > alexandre.gramfort at telecom-paristech.fr> wrote:
> >
> >> > We could stop squashing during development, and use the new
> >> Squash-and-Merge
> >> > button on GitHub.
> >> > What do you think?
> >>
> >> +1
> >>
> >> the reason I see for squashing during dev is to avoid killing the
> >> browser when reviewing. It really rarely happens though.
> >>
> >> A
> >> _______________________________________________
> >> scikit-learn mailing list
> >> scikit-learn at python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing listscikit-learn at python.orghttps://
> mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn at python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <
> http://mail.python.org/pipermail/scikit-learn/attachments/20160614/511d2a1d/attachment.html
> >
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ------------------------------
>
> End of scikit-learn Digest, Vol 3, Issue 27
> *******************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160615/744d1479/attachment-0001.html>