[spambayes-dev] Design Doc for Tokenizer

Mon Oct 30 07:28:52 CET 2006

> We would like to
> understand what interactions the tokenizer has with the different  
> modules.

The tokenizer reads the options to know what tokenizing to do.  Any  
of the other modules that need to tokenize a message use the  
tokenizer.  That's about it*.

> Is there any documentation available that describes the different  
> modules?

There's README-DEVEL.txt in the source, and the (extensive) comments  
in the code.  Feel free to ask questions here.

> We are interested in what the email representation is after email is
> tokenized and going into the learner and classifier.

The email is an iterable (generator in this case, but any iterable  
would do) of strings.

> In addition, we would like to isolate the tokenizer.

Already done - tokenizer.py is already isolated from the rest of  
SpamBayes, other than the options (which control what tokenization is  
done).

=Tony.Meyer

* Ok, not quite all.  The experimental URL slurping option imports  
the classifier, because it only generates tokens if the score is  
already known to be unsure, and the tokenizer doesn't otherwise know  
anything about score.  If this became non-experimental a tidier way  
would be found for this.  The experimental image tokenization also  
uses the ImageStripper module.  And the tokenizer uses  
mboxutils.get_message so that you can pass a string, file, or  
something like that, or a email.Message object, to tokenize (this is  
just convenience, really).