Re: [scikit-learn] adding BM25 relevance function
Hello Pavel and Joel, I forked the repository and cloned it on my machine. I'm using pycharm on a Mac, and while looking at text.py, I'm getting an unresolved reference for "xrange" at line 28: from ..externals.six.moves import range Pycharm says Function 'six.py' is too large to analyze, so I'm not sure if this error is somehow related to that. I decided to try to build the code as a sanity check but I can't find any reliable instructions as to how to do that. Naively, I opened terminal and cd to the directory above "scikit-learn" folder (where I had cloned my fork) and tried to run: $ python3 setup.py install Which didn't work. I got this error: ImportError: No module named 'sklearn' Can someone point me in the right direction? And how can the code try to import sklearn if it doesn't exist yet? Note I haven't installed the release version of scikit-learn using pip or any other tool, but I should be able to bootstrap it from the source code, right? Here's the full error message if it helps. Forgive me if it's a silly mistake, but I haven't found any reliable guidelines online. File "setup.py", line 84, in <module> from numpy.distutils.core import setup File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/core.py", line 26, in <module> from numpy.distutils.command import config, config_compiler, \ File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/command/build_ext.py", line 18, in <module> from numpy.distutils.system_info import combine_paths File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/system_info.py", line 232, in <module> triplet = str(p.communicate()[0].decode().strip()) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 791, in communicate stdout = _eintr_retry_call(self.stdout.read) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call return func(*args) KeyboardInterrupt Basils-MacBook-Pro:sklearn basilbeirouti$ python3 setup.py install non-existing path in '__check_build': '_check_build.c' Appending sklearn.__check_build configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.__check_build') Appending sklearn._build_utils configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn._build_utils') Appending sklearn.covariance configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance') Appending sklearn.covariance/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance/tests') Appending sklearn.cross_decomposition configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.cross_decomposition') Appending sklearn.cross_decomposition/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.cross_decomposition/tests') Appending sklearn.feature_selection configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection') Appending sklearn.feature_selection/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection/tests') Appending sklearn.gaussian_process configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process') Appending sklearn.gaussian_process/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process/tests') Appending sklearn.mixture configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture') Appending sklearn.mixture/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture/tests') Appending sklearn.model_selection configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection') Appending sklearn.model_selection/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection/tests') Appending sklearn.neural_network configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network') Appending sklearn.neural_network/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network/tests') Appending sklearn.preprocessing configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing') Appending sklearn.preprocessing/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing/tests') Appending sklearn.semi_supervised configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised') Appending sklearn.semi_supervised/tests configuration to sklearn Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised/tests') Warning: Assuming default configuration (./_build_utils/{setup__build_utils,setup}.py was not found)Warning: Assuming default configuration (./covariance/{setup_covariance,setup}.py was not found)Warning: Assuming default configuration (./covariance/tests/setup_covariance/{setup_covariance/tests,setup}.py was not found)Warning: Assuming default configuration (./cross_decomposition/{setup_cross_decomposition,setup}.py was not found)Warning: Assuming default configuration (./cross_decomposition/tests/setup_cross_decomposition/{setup_cross_decomposition/tests,setup}.py was not found)Warning: Assuming default configuration (./feature_selection/{setup_feature_selection,setup}.py was not found)Warning: Assuming default configuration (./feature_selection/tests/setup_feature_selection/{setup_feature_selection/tests,setup}.py was not found)Warning: Assuming default configuration (./gaussian_process/{setup_gaussian_process,setup}.py was not found)Warning: Assuming default configuration (./gaussian_process/tests/setup_gaussian_process/{setup_gaussian_process/tests,setup}.py was not found)Warning: Assuming default configuration (./mixture/{setup_mixture,setup}.py was not found)Warning: Assuming default configuration (./mixture/tests/setup_mixture/{setup_mixture/tests,setup}.py was not found)Warning: Assuming default configuration (./model_selection/{setup_model_selection,setup}.py was not found)Warning: Assuming default configuration (./model_selection/tests/setup_model_selection/{setup_model_selection/tests,setup}.py was not found)Warning: Assuming default configuration (./neural_network/{setup_neural_network,setup}.py was not found)Warning: Assuming default configuration (./neural_network/tests/setup_neural_network/{setup_neural_network/tests,setup}.py was not found)Warning: Assuming default configuration (./preprocessing/{setup_preprocessing,setup}.py was not found)Warning: Assuming default configuration (./preprocessing/tests/setup_preprocessing/{setup_preprocessing/tests,setup}.py was not found)Warning: Assuming default configuration (./semi_supervised/{setup_semi_supervised,setup}.py was not found)Warning: Assuming default configuration (./semi_supervised/tests/setup_semi_supervised/{setup_semi_supervised/tests,setup}.py was not found)Traceback (most recent call last): File "setup.py", line 85, in <module> setup(**configuration(top_path='').todict()) File "setup.py", line 44, in configuration config.add_subpackage('cluster') File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 1003, in add_subpackage caller_level = 2) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 972, in get_subpackage caller_level = caller_level + 1) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 884, in _get_configuration_from_setup_py ('.py', 'U', 1)) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", line 234, in load_module return load_source(name, filename, file) File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", line 172, in load_source module = _load(spec) File "<frozen importlib._bootstrap>", line 693, in _load File "<frozen importlib._bootstrap>", line 673, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 662, in exec_module File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed File "./cluster/setup.py", line 8, in <module> from sklearn._build_utils import get_blas_info ImportError: No module named 'sklearn' On Tue, Jun 14, 2016 at 11:41 AM, <scikit-learn-request@python.org> wrote:
Send scikit-learn mailing list submissions to scikit-learn@python.org
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request@python.org
You can reach the person managing the list at scikit-learn-owner@python.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. Re: Adding BM25 relevance function (Pavel Soriano) 2. Re: The culture of commit squashing (Andreas Mueller) 3. Re: The culture of commit squashing (Tom DLT)
----------------------------------------------------------------------
Message: 1 Date: Tue, 14 Jun 2016 16:11:10 +0000 From: Pavel Soriano <sorianopavel@gmail.com> To: Scikit-learn user and developer mailing list <scikit-learn@python.org> Subject: Re: [scikit-learn] Adding BM25 relevance function Message-ID: < CAN0wWk93r2aw9No65CGiCW5hQG7-oFYVZaMJQpXpegTXMSqPLg@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hey,
Good thing that you are trying to finish this.
Well, I looked into my old notes, and the Delta tf-idf comes from the "Delta TFIDF: An Improved Feature Space for Sentiment Analysis" <http://ebiquity.umbc.edu/_file_directory_/papers/446.pdf> paper. I guess it is not very popular and apparently it has a drawback: it does not take into account the number of times a word occurs in each document while calculating the distribution amongst classes. At least that is what I wrote on my notes...
As for the delta idf... If it helps, I can look into my old code cause I do not know what I was talking about. I guess it has to do somehow with the paper cited before.
Cheers,
Pavel Soriano
On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti <basilbeirouti@gmail.com> wrote:
Hi Joel,
Thanks for your response and for digging up that archived thread, it gives me a lot of clarity.
I see your point about BM25, but I think in most cases where TFIDF makes sense, BM25 makes sense as well, but it could be "overkill".
Consider that TFIDF does not produce normalized results either < http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#e... , If BM25 requires dimensionality reduction (eg. using LSA) , so too would TFIDF. The term-document matrix is the same size no matter which weighting scheme is used. The only difference is that BM25 produces better results when the corpus is large enough that the term frequency in a document, and the document frequency in the corpus, can vary considerably across a broad range of values.Maybe you could even say TFIDF and BM25 are the same equation except BM25 has a few additional hyperparameters (b and k).
So is the advantage that BM25 provides for large diverse corpora with it? or is it marginal? Perhaps you can point me to some more examples where TFIDF is used (in supervised setting preferably) and I can plug in BM25 in place of TFIDF and see how it compares. Here are some I found:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_dat...
*(supervised)*
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#e...
(*unsupervised)*
Thank you! Basil
PS: By the way, I'm not familiar with the delta-idf transform that Pavel mentions in the archive you linked, I'll have to delve deeper into that. I agree with the response to Pavel that he should be putting it in a separate class, not adding on to the TFIDF. I think it would take me about 6-8 weeks to adapt my code to the fit transform model and submit a pull request.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Pavel SORIANO
PhD Student ERIC Laboratory Universit? de Lyon
I don't see an unresolved reference to xrange, but I do see that it can't import sklearn. Did you built scikit-learn? See: http://scikit-learn.org/dev/developers/contributing.html#retrieving-the-latest-code\ Either do make or python setup.py build_ext -i or python setup.py develop or pip install . -e (which all do slightly different things) I'd probably go with the first if you have another installation of scikit-learn on your machine and the last if you want to make that your primary installation. Cheers, Andy On 06/15/2016 01:01 AM, Basil Beirouti wrote:
Hello Pavel and Joel,
I forked the repository and cloned it on my machine. I'm using pycharm on a Mac, and while looking at text.py, I'm getting an unresolved reference for "xrange" at line 28:
from ..externals.six.movesimport range Pycharm says Function 'six.py' is too large to analyze, so I'm not sure if this error is somehow related to that. I decided to try to build the code as a sanity check but I can't find any reliable instructions as to how to do that. Naively, I opened terminal and cd to the directory above "scikit-learn" folder (where I had cloned my fork) and tried to run:
$ python3 setup.py install
Which didn't work. I got this error:
ImportError: No module named 'sklearn'
Can someone point me in the right direction? And how can the code try to import sklearn if it doesn't exist yet? Note I haven't installed the release version of scikit-learn using pip or any other tool, but I should be able to bootstrap it from the source code, right?
Here's the full error message if it helps. Forgive me if it's a silly mistake, but I haven't found any reliable guidelines online.
File "setup.py", line 84, in <module>
from numpy.distutils.core import setup
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/core.py", line 26, in <module>
from numpy.distutils.command import config, config_compiler, \
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/command/build_ext.py", line 18, in <module>
from numpy.distutils.system_info import combine_paths
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/system_info.py", line 232, in <module>
triplet = str(p.communicate()[0].decode().strip())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 791, in communicate
stdout = _eintr_retry_call(self.stdout.read)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
return func(*args)
KeyboardInterrupt
Basils-MacBook-Pro:sklearn basilbeirouti$ python3 setup.py install
non-existing path in '__check_build': '_check_build.c'
Appending sklearn.__check_build configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.__check_build')
Appending sklearn._build_utils configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn._build_utils')
Appending sklearn.covariance configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance')
Appending sklearn.covariance/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance/tests')
Appending sklearn.cross_decomposition configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.cross_decomposition')
Appending sklearn.cross_decomposition/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.cross_decomposition/tests')
Appending sklearn.feature_selection configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection')
Appending sklearn.feature_selection/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection/tests')
Appending sklearn.gaussian_process configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process')
Appending sklearn.gaussian_process/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process/tests')
Appending sklearn.mixture configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture')
Appending sklearn.mixture/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture/tests')
Appending sklearn.model_selection configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection')
Appending sklearn.model_selection/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection/tests')
Appending sklearn.neural_network configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network')
Appending sklearn.neural_network/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network/tests')
Appending sklearn.preprocessing configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing')
Appending sklearn.preprocessing/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing/tests')
Appending sklearn.semi_supervised configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised')
Appending sklearn.semi_supervised/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised/tests')
Warning: Assuming default configuration (./_build_utils/{setup__build_utils,setup}.py was not found)Warning: Assuming default configuration (./covariance/{setup_covariance,setup}.py was not found)Warning: Assuming default configuration (./covariance/tests/setup_covariance/{setup_covariance/tests,setup}.py was not found)Warning: Assuming default configuration (./cross_decomposition/{setup_cross_decomposition,setup}.py was not found)Warning: Assuming default configuration (./cross_decomposition/tests/setup_cross_decomposition/{setup_cross_decomposition/tests,setup}.py was not found)Warning: Assuming default configuration (./feature_selection/{setup_feature_selection,setup}.py was not found)Warning: Assuming default configuration (./feature_selection/tests/setup_feature_selection/{setup_feature_selection/tests,setup}.py was not found)Warning: Assuming default configuration (./gaussian_process/{setup_gaussian_process,setup}.py was not found)Warning: Assuming default configuration (./gaussian_process/tests/setup_gaussian_process/{setup_gaussian_process/tests,setup}.py was not found)Warning: Assuming default configuration (./mixture/{setup_mixture,setup}.py was not found)Warning: Assuming default configuration (./mixture/tests/setup_mixture/{setup_mixture/tests,setup}.py was not found)Warning: Assuming default configuration (./model_selection/{setup_model_selection,setup}.py was not found)Warning: Assuming default configuration (./model_selection/tests/setup_model_selection/{setup_model_selection/tests,setup}.py was not found)Warning: Assuming default configuration (./neural_network/{setup_neural_network,setup}.py was not found)Warning: Assuming default configuration (./neural_network/tests/setup_neural_network/{setup_neural_network/tests,setup}.py was not found)Warning: Assuming default configuration (./preprocessing/{setup_preprocessing,setup}.py was not found)Warning: Assuming default configuration (./preprocessing/tests/setup_preprocessing/{setup_preprocessing/tests,setup}.py was not found)Warning: Assuming default configuration (./semi_supervised/{setup_semi_supervised,setup}.py was not found)Warning: Assuming default configuration (./semi_supervised/tests/setup_semi_supervised/{setup_semi_supervised/tests,setup}.py was not found)Traceback (most recent call last):
File "setup.py", line 85, in <module>
setup(**configuration(top_path='').todict())
File "setup.py", line 44, in configuration
config.add_subpackage('cluster')
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 1003, in add_subpackage
caller_level = 2)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 972, in get_subpackage
caller_level = caller_level + 1)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 884, in _get_configuration_from_setup_py
('.py', 'U', 1))
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", line 234, in load_module
return load_source(name, filename, file)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", line 172, in load_source
module = _load(spec)
File "<frozen importlib._bootstrap>", line 693, in _load
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 662, in exec_module
File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
File "./cluster/setup.py", line 8, in <module>
from sklearn._build_utils import get_blas_info
ImportError: No module named 'sklearn'
On Tue, Jun 14, 2016 at 11:41 AM, <scikit-learn-request@python.org <mailto:scikit-learn-request@python.org>> wrote:
Send scikit-learn mailing list submissions to scikit-learn@python.org <mailto:scikit-learn@python.org>
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request@python.org <mailto:scikit-learn-request@python.org>
You can reach the person managing the list at scikit-learn-owner@python.org <mailto:scikit-learn-owner@python.org>
When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. Re: Adding BM25 relevance function (Pavel Soriano) 2. Re: The culture of commit squashing (Andreas Mueller) 3. Re: The culture of commit squashing (Tom DLT)
----------------------------------------------------------------------
Message: 1 Date: Tue, 14 Jun 2016 16:11:10 +0000 From: Pavel Soriano <sorianopavel@gmail.com <mailto:sorianopavel@gmail.com>> To: Scikit-learn user and developer mailing list <scikit-learn@python.org <mailto:scikit-learn@python.org>> Subject: Re: [scikit-learn] Adding BM25 relevance function Message-ID:
<CAN0wWk93r2aw9No65CGiCW5hQG7-oFYVZaMJQpXpegTXMSqPLg@mail.gmail.com <mailto:CAN0wWk93r2aw9No65CGiCW5hQG7-oFYVZaMJQpXpegTXMSqPLg@mail.gmail.com>> Content-Type: text/plain; charset="utf-8"
Hey,
Good thing that you are trying to finish this.
Well, I looked into my old notes, and the Delta tf-idf comes from the "Delta TFIDF: An Improved Feature Space for Sentiment Analysis" <http://ebiquity.umbc.edu/_file_directory_/papers/446.pdf> paper. I guess it is not very popular and apparently it has a drawback: it does not take into account the number of times a word occurs in each document while calculating the distribution amongst classes. At least that is what I wrote on my notes...
As for the delta idf... If it helps, I can look into my old code cause I do not know what I was talking about. I guess it has to do somehow with the paper cited before.
Cheers,
Pavel Soriano
On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti <basilbeirouti@gmail.com <mailto:basilbeirouti@gmail.com>> wrote:
> Hi Joel, > > Thanks for your response and for digging up that archived thread, it gives > me a lot of clarity. > > I see your point about BM25, but I think in most cases where TFIDF makes > sense, BM25 makes sense as well, but it could be "overkill". > > Consider that TFIDF does not produce normalized results either > <http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#e...>, > If BM25 requires dimensionality reduction (eg. using LSA) , so too would > TFIDF. The term-document matrix is the same size no matter which weighting > scheme is used. The only difference is that BM25 produces better results > when the corpus is large enough that the term frequency in a document, and > the document frequency in the corpus, can vary considerably across a broad > range of values.Maybe you could even say TFIDF and BM25 are the same > equation except BM25 has a few additional hyperparameters (b and k). > > So is the advantage that BM25 provides for large diverse corpora with it? > or is it marginal? Perhaps you can point me to some more examples where > TFIDF is used (in supervised setting preferably) and I can plug in BM25 in > place of TFIDF and see how it compares. Here are some I found: > > > http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_dat... > *(supervised)* > > http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#e... > (*unsupervised)* > > Thank you! > Basil > > PS: By the way, I'm not familiar with the delta-idf transform that Pavel > mentions in the archive you linked, I'll have to delve deeper into that. I > agree with the response to Pavel that he should be putting it in a separate > class, not adding on to the TFIDF. I think it would take me about 6-8 weeks > to adapt my code to the fit transform model and submit a pull request. > > > > > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org <mailto:scikit-learn@python.org> > https://mail.python.org/mailman/listinfo/scikit-learn > -- Pavel SORIANO
PhD Student ERIC Laboratory Universit? de Lyon
If xrange is the issue, then the branch you're getting may not have been tested for Python 3. On 16 June 2016 at 03:53, Andreas Mueller <t3kcit@gmail.com> wrote:
I don't see an unresolved reference to xrange, but I do see that it can't import sklearn. Did you built scikit-learn? See:
http://scikit-learn.org/dev/developers/contributing.html#retrieving-the-latest-code\
Either do
make or python setup.py build_ext -i or python setup.py develop or pip install . -e
(which all do slightly different things)
I'd probably go with the first if you have another installation of scikit-learn on your machine and the last if you want to make that your primary installation.
Cheers, Andy
On 06/15/2016 01:01 AM, Basil Beirouti wrote:
Hello Pavel and Joel,
I forked the repository and cloned it on my machine. I'm using pycharm on a Mac, and while looking at text.py, I'm getting an unresolved reference for "xrange" at line 28:
from ..externals.six.moves import range
Pycharm says Function 'six.py' is too large to analyze, so I'm not sure if this error is somehow related to that. I decided to try to build the code as a sanity check but I can't find any reliable instructions as to how to do that. Naively, I opened terminal and cd to the directory above "scikit-learn" folder (where I had cloned my fork) and tried to run:
$ python3 setup.py install
Which didn't work. I got this error:
ImportError: No module named 'sklearn'
Can someone point me in the right direction? And how can the code try to import sklearn if it doesn't exist yet? Note I haven't installed the release version of scikit-learn using pip or any other tool, but I should be able to bootstrap it from the source code, right?
Here's the full error message if it helps. Forgive me if it's a silly mistake, but I haven't found any reliable guidelines online.
File "setup.py", line 84, in <module>
from numpy.distutils.core import setup
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/core.py", line 26, in <module>
from numpy.distutils.command import config, config_compiler, \
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/command/build_ext.py", line 18, in <module>
from numpy.distutils.system_info import combine_paths
File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/numpy/distutils/system_info.py", line 232, in <module>
triplet = str(p.communicate()[0].decode().strip())
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 791, in communicate
stdout = _eintr_retry_call(self.stdout.read)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
return func(*args)
KeyboardInterrupt
Basils-MacBook-Pro:sklearn basilbeirouti$ python3 setup.py install
non-existing path in '__check_build': '_check_build.c'
Appending sklearn.__check_build configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.__check_build')
Appending sklearn._build_utils configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn._build_utils')
Appending sklearn.covariance configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance')
Appending sklearn.covariance/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.covariance/tests')
Appending sklearn.cross_decomposition configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.cross_decomposition')
Appending sklearn.cross_decomposition/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.cross_decomposition/tests')
Appending sklearn.feature_selection configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection')
Appending sklearn.feature_selection/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.feature_selection/tests')
Appending sklearn.gaussian_process configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process')
Appending sklearn.gaussian_process/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.gaussian_process/tests')
Appending sklearn.mixture configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture')
Appending sklearn.mixture/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.mixture/tests')
Appending sklearn.model_selection configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection')
Appending sklearn.model_selection/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.model_selection/tests')
Appending sklearn.neural_network configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network')
Appending sklearn.neural_network/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.neural_network/tests')
Appending sklearn.preprocessing configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing')
Appending sklearn.preprocessing/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.preprocessing/tests')
Appending sklearn.semi_supervised configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised')
Appending sklearn.semi_supervised/tests configuration to sklearn
Ignoring attempt to set 'name' (from 'sklearn' to 'sklearn.semi_supervised/tests')
Warning: Assuming default configuration (./_build_utils/{setup__build_utils,setup}.py was not found)Warning: Assuming default configuration (./covariance/{setup_covariance,setup}.py was not found)Warning: Assuming default configuration (./covariance/tests/setup_covariance/{setup_covariance/tests,setup}.py was not found)Warning: Assuming default configuration (./cross_decomposition/{setup_cross_decomposition,setup}.py was not found)Warning: Assuming default configuration (./cross_decomposition/tests/setup_cross_decomposition/{setup_cross_decomposition/tests,setup}.py was not found)Warning: Assuming default configuration (./feature_selection/{setup_feature_selection,setup}.py was not found)Warning: Assuming default configuration (./feature_selection/tests/setup_feature_selection/{setup_feature_selection/tests,setup}.py was not found)Warning: Assuming default configuration (./gaussian_process/{s e tup_gaussian_process,setup}.py was not found)Warning: Assuming default configuration (./gaussian_process/tests/setup_gaussian_process/{setup_gaussian_process/tests,setup}.py was not found)Warning: Assuming default configuration (./mixture/{setup_mixture,setup}.py was not found)Warning: Assuming default configuration (./mixture/tests/setup_mixture/{setup_mixture/tests,setup}.py was not found)Warning: Assuming default configuration (./model_selection/{setup_model_selection,setup}.py was not found)Warning: Assuming default configuration (./model_selection/tests/setup_model_selection/{setup_model_selection/tests,setup}.py was not found)Warning: Assuming default configuration (./neural_network/{setup_neural_network,setup}.py was not found)Warning: Assuming default configuration (./neural_network/tests/setup_neural_network/{setup_neural_network/tests,setup}.py was not found)Warning: Assuming default configuration (./preprocessing/{setup_preprocessing,setup}.py was not found)Warning: Assumi n g default configuration (./preprocessing/tests/setup_preprocessing/{setup_preprocessing/tests,setup}.py was not found)Warning: Assuming default configuration (./semi_supervised/{setup_semi_supervised,setup}.py was not found)Warning: Assuming default configuration (./semi_supervised/tests/setup_semi_supervised/{setup_semi_supervised/tests,setup}.py was not found)Traceback (most recent call last):
File "setup.py", line 85, in <module>
setup(**configuration(top_path='').todict())
File "setup.py", line 44, in configuration
config.add_subpackage('cluster')
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 1003, in add_subpackage
caller_level = 2)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 972, in get_subpackage
caller_level = caller_level + 1)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/numpy/distutils/misc_util.py", line 884, in _get_configuration_from_setup_py
('.py', 'U', 1))
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", line 234, in load_module
return load_source(name, filename, file)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/imp.py", line 172, in load_source
module = _load(spec)
File "<frozen importlib._bootstrap>", line 693, in _load
File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 662, in exec_module
File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
File "./cluster/setup.py", line 8, in <module>
from sklearn._build_utils import get_blas_info
ImportError: No module named 'sklearn'
On Tue, Jun 14, 2016 at 11:41 AM, <scikit-learn-request@python.org> wrote:
Send scikit-learn mailing list submissions to scikit-learn@python.org
To subscribe or unsubscribe via the World Wide Web, visit https://mail.python.org/mailman/listinfo/scikit-learn or, via email, send a message with subject or body 'help' to scikit-learn-request@python.org
You can reach the person managing the list at scikit-learn-owner@python.org
When replying, please edit your Subject line so it is more specific than "Re: Contents of scikit-learn digest..."
Today's Topics:
1. Re: Adding BM25 relevance function (Pavel Soriano) 2. Re: The culture of commit squashing (Andreas Mueller) 3. Re: The culture of commit squashing (Tom DLT)
----------------------------------------------------------------------
Message: 1 Date: Tue, 14 Jun 2016 16:11:10 +0000 From: Pavel Soriano <sorianopavel@gmail.com> To: Scikit-learn user and developer mailing list <scikit-learn@python.org> Subject: Re: [scikit-learn] Adding BM25 relevance function Message-ID: < CAN0wWk93r2aw9No65CGiCW5hQG7-oFYVZaMJQpXpegTXMSqPLg@mail.gmail.com> Content-Type: text/plain; charset="utf-8"
Hey,
Good thing that you are trying to finish this.
Well, I looked into my old notes, and the Delta tf-idf comes from the "Delta TFIDF: An Improved Feature Space for Sentiment Analysis" <http://ebiquity.umbc.edu/_file_directory_/papers/446.pdf> paper. I guess it is not very popular and apparently it has a drawback: it does not take into account the number of times a word occurs in each document while calculating the distribution amongst classes. At least that is what I wrote on my notes...
As for the delta idf... If it helps, I can look into my old code cause I do not know what I was talking about. I guess it has to do somehow with the paper cited before.
Cheers,
Pavel Soriano
On Tue, Jun 14, 2016 at 5:49 PM Basil Beirouti < <basilbeirouti@gmail.com>basilbeirouti@gmail.com> wrote:
Hi Joel,
Thanks for your response and for digging up that archived thread, it gives me a lot of clarity.
I see your point about BM25, but I think in most cases where TFIDF makes sense, BM25 makes sense as well, but it could be "overkill".
Consider that TFIDF does not produce normalized results either < http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#e... , If BM25 requires dimensionality reduction (eg. using LSA) , so too would TFIDF. The term-document matrix is the same size no matter which weighting scheme is used. The only difference is that BM25 produces better results when the corpus is large enough that the term frequency in a document, and the document frequency in the corpus, can vary considerably across a broad range of values.Maybe you could even say TFIDF and BM25 are the same equation except BM25 has a few additional hyperparameters (b and k).
So is the advantage that BM25 provides for large diverse corpora with it? or is it marginal? Perhaps you can point me to some more examples where TFIDF is used (in supervised setting preferably) and I can plug in BM25 in place of TFIDF and see how it compares. Here are some I found:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_dat...
*(supervised)*
(*unsupervised)*
Thank you! Basil
PS: By the way, I'm not familiar with the delta-idf transform that Pavel mentions in the archive you linked, I'll have to delve deeper into
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#e... that. I
agree with the response to Pavel that he should be putting it in a separate class, not adding on to the TFIDF. I think it would take me about 6-8 weeks to adapt my code to the fit transform model and submit a pull request.
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
-- Pavel SORIANO
PhD Student ERIC Laboratory Universit? de Lyon
participants (3)
-
Andreas Mueller -
Basil Beirouti -
Joel Nothman