[python-uk] Bayesian filter
Carles Pina i Estany
carles at pina.cat
Fri May 7 22:29:52 CEST 2010
(scroll to PYTHON TEST for a test)
I've took another look to the Bayesian filter (it was not my "task" :-)
but it's my pleasure).
Ok, to start, Reverend tokenizes the training texts and works only on
token level, not sub-token level. So we should not expect that will
detect c0mputer as computer (quite common mistake yesterday, I think)
(I was doing a high-level mathematical description, but I will postpone
-for when I will check some more things- or just leave for the Pythoner
who will do it for the next Meetup ;-) )
carles at pinux:~/bayes$ ls training/
bash c++ python
Each of these directories contains between 18 and 29 files that I've
copied randomly from different places of my hard disk.
Then I have:
carles at pinux:~/bayes$ ls guessing/
demanar.py keymap.sh medium.py qdacco.cpp
carles at pinux:~/bayes$
some other files that I've copied there...
The Bayesian filter never knows the name of the file.
Just using this set for training, look the results:
----- Start test
./guessing/qdacco.cpp [('c++', 0.6590693537529797), ('python',
0.59287521198182513), ('bash', 0.28091954259046653)]
./guessing/demanar.py [('python', 0.58882188718297557), ('c++',
0.57869106382644175), ('bash', 0.36380374534210203)]
./guessing/keymap.sh [('bash', 0.54270073170250122), ('c++',
0.47142124856042872), ('python', 0.36321294599284148)]
./guessing/main.py [('python', 0.65909707358336711), ('c++',
0.52731742496139433), ('bash', 0.3261511618248264)]
I consider it quite good. bayes.py is 30 lines long -could be less- and
it works pretty well, even having only parts of the program (don't tell
me to check for #include , #!/bin/bash or #!/usr/bin/python, not needed
at all, works with snippets of code, etc.)
Yes, there is one case that guess that it's Pythonn and not far from
c++. I probably need a bigger data set, but even then if it guess it
"quite well" then is "quite good" :)
(I'm thinking, for example, in some service like pastebin, that would
guess that the code that you are copy-pasting there, and if you change
the guess, it can train itself with the new code).
My training sets are very noisy, and I should subclass Reverend and
improve the tokenizer to use a a separator "=", "(", ")" and other
things, since now a line like:
It's one token...
The literals should be probably removed as well.
I'm taking a look to the statistics part. Here is a good start:
Carles Pina i Estany
More information about the python-uk