[Spambayes] better Received header tokens

T. Alexander Popiel popiel at wolfskeep.com
Sun Mar 9 18:37:09 EST 2003


In message:  <20030309200808.GA19398 at glacier.arctrix.com>
             Neil Schemenauer <nas at python.ca> writes:

>I wasted some time today trying to improve the mine_received_headers
>option.  The goal was to generate fewer more useful tokens.  Also,
>I wanted to be resistent to received header forgery. [...]

>I expected this to do better than the current code.  Testing shows
>otherwise.  Perhaps using a more specific or more general network
>(instead of /16) would help.

Something that has occured to me recently: how many tokens does it
take to significantly change the scores?  Most of the recent tokenizing
experiments have been adding between one and a handful of tokens, or
even reducing token count.  Perhaps our problem is not that the
identification methods we're coming up with are bad (heck, Tim did
indicate that the bytes/word token _was_ a strong indicator... I
didn't look at the values for the token itself), but rather that
these new methods of identification are getting drowned out in the
noise.

Perhaps we should figure out some way to give metatokens extra
weight in the combining calculations?  I'm afraid that I don't
have a strong enough math background to know how to do this.

Alternately, we could drop the limit on the number of tokens looked
at from 150 back down to around 20...

- Alex



More information about the Spambayes mailing list