[python-uk] Bayesian filter

Carles Pina i Estany carles at pina.cat
Sat May 8 02:10:43 CEST 2010


On May/07/2010, Thomas Dunham wrote:
> Thanks Carles, will try to force some of this into my head this
> weekend....


For me one of the keys is in:

Just above "Using the Bayesian result".

I can read the formulas like:
-Probability of the document being spam is the multiplicatoin of each
individual word of this document being spam

Also interesting here:
When talks about "Combining individual probabilities"
(talks about the assumptions and links to the previous Wikipedia

Other key is in the file reverend/thomas.py, buildCache, where it
computes the probability of each token to belong in each group. The
thing is that there is doing some "magic" with the metrics that, at the
moment, I'm not following very well (what it does and why is needed).

So, in a very high level does:
-Tokenize the input
-Save how many times appears each word in the corpus

buildCache (so, part of guessing if no more training is done):
-Calculates, per token, how likely is to be in each category (and
something else that I'm not following with the good and badMetric

-Tokenize the new input
-Combines the probabilities of each token of the input, using the cache
to know how likely is this token to be of each category.

I think that this is a very high level design with some mistake for

If someone can calculate one example by hand and the result is the same
than Reverend would get some extra points :-D I'm only quite confused
with some things in buildCache...

Carles Pina i Estany

More information about the python-uk mailing list