Hey all, I'm currently doing a UC Berkeley research project. We would like to understand what interactions the tokenizer has with the different modules. Is there any documentation available that describes the different modules? We are interested in what the email representation is after email is tokenized and going into the learner and classifier. In addition, we would like to isolate the tokenizer. Any help would be appreciated. Thanks in advance for your response. Kai Xia
We would like to understand what interactions the tokenizer has with the different modules.
The tokenizer reads the options to know what tokenizing to do. Any of the other modules that need to tokenize a message use the tokenizer. That's about it*.
Is there any documentation available that describes the different modules?
There's README-DEVEL.txt in the source, and the (extensive) comments in the code. Feel free to ask questions here.
We are interested in what the email representation is after email is tokenized and going into the learner and classifier.
The email is an iterable (generator in this case, but any iterable would do) of strings.
In addition, we would like to isolate the tokenizer.
Already done - tokenizer.py is already isolated from the rest of SpamBayes, other than the options (which control what tokenization is done). =Tony.Meyer * Ok, not quite all. The experimental URL slurping option imports the classifier, because it only generates tokens if the score is already known to be unsure, and the tokenizer doesn't otherwise know anything about score. If this became non-experimental a tidier way would be found for this. The experimental image tokenization also uses the ImageStripper module. And the tokenizer uses mboxutils.get_message so that you can pass a string, file, or something like that, or a email.Message object, to tokenize (this is just convenience, really).
participants (2)
-
dk7x@berkeley.edu -
Tony Meyer