[Spambayes] SpamBayes classifier to the arabic language _ Help
skip at pobox.com
Sat Feb 23 00:57:00 CET 2013
> I have a graduation project and decide to use a spambayes technique to
> classification arabic spam email by the python environment .
> My questions how i can to run it with the python programming language ,what
> is the packages that must to use it with python .
> How i can to linking this technique with another preprocessing technique .
> How it can work with arabic language .
> How can I pass messages to the training and testing process .
You will need to download the SpamBayes source distribution so you get
the test environment and are able to easily make changes to the code.
I recently created a Git repository at GitHub:
You can just clone that repository. If you make changes to the code
you would like incorporated into SpamBayes, you can create a pull
request when you are ready.
Once you've downloaded the code you should familiarize yourself with
the tokenizer code in spambayes/spambayes/tokenizer.py. (You can
ignore everything in the website directory.) The tokenizer file
contains many detailed comments about what did and didn't work when
SpamBayes was originally developed. Arabic text will be full of
non-ASCII characters. Search for "highbit" and "8bit" to decide how
you want to handle that. I'm pretty sure you will have to modify that
code. Also, if Arabic text uses something other than an ASCII space
char to separate words you will have to fix that. It's unlikely you
will need to modify the classifier, at least initially, but it will
pay to read through that heavily commented code as well. The output
of the tokenizer step is the input to the classifier. Knowing how to
set its parameters will help when testing.
Familiarize yourself with spambayes/TESTING.txt to learn how to test
Finally, you will need fairly large collections of spam and ham
emails. The TESTING file should describe the requirements there.
More information about the SpamBayes