[scikit-learn] sklearn - knn sklearn.neighbors kneighbors function producing unexpected result for text analysis?

Tue Apr 18 06:15:27 EDT 2017

towards debugging, perhaps add the return_distances option

On 16 Apr 2017 9:19 pm, "Evaristo Caraballo via scikit-learn" <
scikit-learn at python.org> wrote:

> I have been asked to implement a simple knn for text similarity analysis.
> I tried by using sklearn.neighbors module.
> The file to be analysed consisted on 2 relevant columns: "text" and "name".
> The knn model should be fitted with bag-of-words of a corpus of around
> 60,000 pre-treated text fragments of about 200 words each. I used
> CounterVectorizer.
> As test I was asked to use the model to get the names in the "name" column
> related to 10 top text strings that are the closest to a pre-selected one
> that also exists in the corpus used to initialise the knn model. Similarity
> distance should be measured using an euclidean metric.
> I used the kneighbors function to obtain the closest neighbors.
> Below you can find the code I was trying to implement using kneighbors:
>
> import os, sysimport sklearnimport sklearn.neighbors as sk_neighborsfrom sklearn.feature_extraction.text import CountVectorizerimport pandasimport scipyimport matplotlib.pyplot as pltimport numpy as np%matplotlib inline
>
> wiki = pandas.read_csv('wiki_filefragment.csv')
>
> mod_count_vect = CountVectorizer()
> count_vect = mod_count_vect.fit_transform(wiki['text'])print(count_vect.shape)
> mod_count_vect.get_feature_names()
>
> mod_enc = sklearn.preprocessing.LabelEncoder().fit(wiki['name'])
> enc = mod_enc.transform(wiki['name'])
> enc
>
> model = sk_neighbors.NearestNeighbors( n_neighbors=10, algorithm='brute',  p = 2 ) #no matter what I use, it is always the same
> modelfit = model.fit(count_vect, enc)
> #also likely the kneighbors is not working?print( mod_enc.inverse_transform( modelfit.kneighbors( count_vect[mod_enc.transform( ['Franz Rottensteiner'] )], n_neighbors=11, return_distance=False ) ) )
>
> This implementation gave me the following results for the first 10 nearest
> neighbors to 'Franz Rottensteiner':
>
> Franz Rottensteiner, Ren%C3%A9 Froger, Ichikawa Ennosuke III, Tofusquirrel
> , M. G. Sheftall, Peter Maurer, Allan Weisbecker, Ferdinand Knobloch,
> Andrea Foulkes, Alan W. Meerow, John Warner (writer)
>
> The results continued to be far from being close to the test solution
> (which use Graphlab Create and SFrame), which are:
>
> Franz Rottensteiner, Ian Mitchell (author), Rajiva Wijesinha, Andr%C3%A9
> Hurst, Leslie R. Landrum, Andrew Pinsent, Alan W. Meerow, John Angus
> Campbell, Antonello Bonci, Henkjan Honing, Joseph Born Kadane
>
> In fact, I tried a simple brute force implementation by iterating over the
> list of texts calculating distances with scipy and that gave me the
> expected results. The result was the same after also using Python 2.7.
> A link to the implementations (the one that doesn't work and the one that
> does) together a pick the file used for this test can be found on this
> Gist <https://gist.github.com/evaristoc/eb2f2d91524b874c4db6638359e32b0f>.
> Does anyone can suggest what it is wrong with my sklearn implementation?
> Relevant resources are: - Anaconda Python3.5 (with a virtenv using 2.7) -
> Jupyter - sklearn 0.18 - pandas
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn at python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170418/c9b996cb/attachment-0001.html>