trying to improve my knn algorithm
Raine Pretorius
raine.pretorius at pretoriusse.net
Thu Jul 2 05:25:27 EDT 2020
Hi,
I think you sent this to the wrong person.
[cid:storage_emulated_0__EmailTempImage_5_TempSignature_signature_20200630_205218_jpg_1593543161247]
[cid:storage_emulated_0__EmailTempImage_5_TempSignature_signature_20200630_205334_jpg_1593543223538]
[cid:storage_emulated_0__EmailTempImage_5_TempSignature_signature_20200630_205420_jpg_1593543265258][cid:storage_emulated_0__EmailTempImage_5_TempSignature_signature_20200630_205456_jpg_1593543303538]
Kind Regards,
Raine Pretorius
[cid:storage_emulated_0__EmailTempImage_5_TempSignature_signature_20200630_205658_jpg_1593543438262]
-------- Original message --------
From: Peter Otten <__peter__ at web.de>
Date: 2020/07/02 11:09 (GMT+02:00)
To: python-list at python.org
Subject: Re: trying to improve my knn algorithm
kyrohammy at gmail.com wrote:
> This is another account but I am the op. Why do you mean normalize? Sorry
> I’m new at this.
Take three texts containing the words
covid, vaccine, program, python
Some preparatory imports because I'm using numpy:
>>> from numpy import array
>>> from numpy.linalg import norm
The texts as vectors, the first entry representing "covid" etc.:
>>> text1 = array([1, 1, 0, 0]) # a short text about health
>>> text2 = array([5, 5, 0, 0]) # a longer text about health
>>> text3 = array([0, 0, 1, 1]) # a short text about programming in Python
Using your distance algorithm you get
>>> norm(text1-text2)
5.6568542494923806
>>> norm(text1-text3)
2.0
The two short texts have greater similarity than the texts about the same
topic!
You get a better result if you divide by the total number of words, i. e.
replace absolute word count with relative word frequency
>>> text1/text1.sum()
array([ 0.5, 0.5, 0. , 0. ])
>>> norm(text1/text1.sum() - text2/text2.sum())
0.0
>>> norm(text1/text1.sum() - text3/text3.sum())
1.0
or normalize the vector length:
>>> norm(text1/norm(text1) - text2/norm(text2))
0.0
>>> norm(text1/norm(text1) - text3/norm(text3))
1.4142135623730949
--
https://mail.python.org/mailman/listinfo/python-list
More information about the Python-list
mailing list