trying to improve my knn algorithm

Thu Jul 2 05:06:23 EDT 2020

kyrohammy at gmail.com wrote:

> This is another account but I am the op. Why do you mean normalize? Sorry
> I’m new at this.

Take three texts containing the words

covid, vaccine, program, python

Some preparatory imports because I'm using numpy:

>>> from numpy import array
>>> from numpy.linalg import norm

The texts as vectors, the first entry representing "covid" etc.:

>>> text1 = array([1, 1, 0, 0])  # a short text about health
>>> text2 = array([5, 5, 0, 0])  # a longer text about health
>>> text3 = array([0, 0, 1, 1])  # a short text about programming in Python

Using your distance algorithm you get

>>> norm(text1-text2)
5.6568542494923806
>>> norm(text1-text3)
2.0

The two short texts have greater similarity than the texts about the same 
topic!

You get a better result if you divide by the total number of words, i. e. 
replace absolute word count with relative word frequency

>>> text1/text1.sum()
array([ 0.5,  0.5,  0. ,  0. ])

>>> norm(text1/text1.sum() - text2/text2.sum())
0.0
>>> norm(text1/text1.sum() - text3/text3.sum())
1.0

or normalize the vector length:

>>> norm(text1/norm(text1) - text2/norm(text2))
0.0
>>> norm(text1/norm(text1) - text3/norm(text3))
1.4142135623730949