[spambayes-dev] effective tokenizer for wiki text

Tony Meyer tameyer at ihug.co.nz
Tue Oct 31 01:37:17 CET 2006

> Why not just create an "email message" out of the input?  If the  
> headers are
> identical in every message they won't generate any useful tokens  
> and the
> message body will be all that yields useful clues.  OTOH, if you  
> have login
> or IP address information for the spammers, you might suitably  
> populate the
> From: field.

ISTM that it would be just as little work to write a "wiki-page to  
email" module as to create a Tokenizer subclass that tokenizes wiki  
pages.  You can then skip all of the header tokenization (and any  
email-specific tokenization in the body, if there is any, but I can't  
think of any) and generate any additional tokens out of any metadata  
that might be available (maybe comment, author, etc?).

>> Are there examples from other people that have written custom  
>> tokenizers
>> that may be helpful, or do you have any hints on what to take into
>> account for writing an effective tokenizer for Wiki text?

What exactly gets passed to the tokenizer?  Anything more than just  
the content (complete? diff?) of the wiki page?  If it's just the  
content/diff then other than the words themselves, URLs are probably  
the most useful content.  You could try enabling (or improving) the  
URL slurping code, perhaps.

> So far, I think most of us have bent our input to look like email.   
> I think
> that would be a lot easier than writing and debugging a new tokenizer.

A tokenizer's pretty simple, really - all it has to do is take the  
object you want to tokenize and yield a series of strings.  It's been  
a couple of years, but I wrote some non-email tokenizers at one point.


