[python-uk] Bayesian filter

Fri May 7 22:29:52 CEST 2010

Hello,

(scroll to PYTHON TEST for a test)

I've took another look to the Bayesian filter (it was not my "task" :-)
but it's my pleasure).

Ok, to start, Reverend tokenizes the training texts and works only on
token level, not sub-token level. So we should not expect that will
detect c0mputer as computer (quite common mistake yesterday, I think)

(I was doing a high-level mathematical description, but I will postpone
-for when I will check some more things- or just leave for the Pythoner
who will do it for the next Meetup ;-) )

PYTHON TEST
carles at pinux:~/bayes$ ls training/
bash  c++  python

Each of these directories contains between 18 and 29 files that I've
copied randomly from different places of my hard disk.

Then I have:
carles at pinux:~/bayes$ ls guessing/
demanar.py  keymap.sh  medium.py  qdacco.cpp
carles at pinux:~/bayes$ 

some other files that I've copied there...

The Bayesian filter never knows the name of the file.

Just using this set for training, look the results:

----- Start test
./guessing/qdacco.cpp [('c++', 0.6590693537529797), ('python',
0.59287521198182513), ('bash', 0.28091954259046653)]

./guessing/demanar.py [('python', 0.58882188718297557), ('c++',
0.57869106382644175), ('bash', 0.36380374534210203)]

./guessing/keymap.sh [('bash', 0.54270073170250122), ('c++',
0.47142124856042872), ('python', 0.36321294599284148)]

./guessing/main.py [('python', 0.65909707358336711), ('c++',
0.52731742496139433), ('bash', 0.3261511618248264)]

I consider it quite good. bayes.py is 30 lines long -could be less- and
it works pretty well, even having only parts of the program (don't tell
me to check for #include , #!/bin/bash or #!/usr/bin/python, not needed
at all, works with snippets of code, etc.)

Yes, there is one case that guess that it's Pythonn and not far from
c++. I probably need a bigger data set, but even then if it guess it
"quite well" then is "quite good" :)

(I'm thinking, for example, in some service like pastebin, that would
guess that the code that you are copy-pasting there, and if you change
the guess, it can train itself with the new code).

My training sets are very noisy, and I should subclass Reverend and
improve the tokenizer to use a a separator "=", "(", ")" and other
things, since now a line like:
        linia=random.randint(1,float(total_paraules))
It's one token...

The literals should be probably removed as well.

I'm taking a look to the statistics part. Here is a good start:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier

Cheers,

-- 
Carles Pina i Estany
	http://pinux.info