[spambayes-dev] Design Doc for Tokenizer
Tony Meyer
tameyer at ihug.co.nz
Mon Oct 30 07:28:52 CET 2006
> We would like to
> understand what interactions the tokenizer has with the different
> modules.
The tokenizer reads the options to know what tokenizing to do. Any
of the other modules that need to tokenize a message use the
tokenizer. That's about it*.
> Is there any documentation available that describes the different
> modules?
There's README-DEVEL.txt in the source, and the (extensive) comments
in the code. Feel free to ask questions here.
> We are interested in what the email representation is after email is
> tokenized and going into the learner and classifier.
The email is an iterable (generator in this case, but any iterable
would do) of strings.
> In addition, we would like to isolate the tokenizer.
Already done - tokenizer.py is already isolated from the rest of
SpamBayes, other than the options (which control what tokenization is
done).
=Tony.Meyer
* Ok, not quite all. The experimental URL slurping option imports
the classifier, because it only generates tokens if the score is
already known to be unsure, and the tokenizer doesn't otherwise know
anything about score. If this became non-experimental a tidier way
would be found for this. The experimental image tokenization also
uses the ImageStripper module. And the tokenizer uses
mboxutils.get_message so that you can pass a string, file, or
something like that, or a email.Message object, to tokenize (this is
just convenience, really).
More information about the spambayes-dev
mailing list