[Spambayes] Another software in the field

Justin Mason jm@jmason.org
Fri Nov 15 12:45:15 2002


Matt Sergeant said:
> > One thing that suprises me is that there's a seemingly endless list of 
> > projects all implementing Graham's approach exactly as he originally
> > described it - almost no-one else is doing the basic testing and
> > research that this sort of approach would seem to cry out for.
> Why would anyone else want to, when you guys are doing such an amazing 
> job of it? ;-)

Hi all,

Well, I've just started for SpamAssassin -- I'm gradually reinventing the
wheel I think.  For example, I've just found that including hapaxes
improves the middle ground very well, which I think is something you guys
did a long time ago ;)

But here's one thing I've noticed which might be useful for you guys.
In SpamAssassin recently, we've been meditating on Message-Ids;
particularly Outlook-format ones, like:

	<002901c28c22$3e8cb260$0201a8c0@gorm>

now, I've figured out this is composed of

	<???? TIMESTAMP $ ???????? $ SENDERID @ hostname>

TIMESTAMP is the top 4 bytes of the FILETIME struct on windows, which
we can validate in SpamAssassin using perl code. not a runner for
spambayes, unfortunately.

However, SENDERID is a constant value which never changes for an Outlook
or Exchange installation, as far as I can see -- so you want to make sure
your tokenizer will parse message-ids, and will return that as one
token.  It will gain valuable probabilities for those tricky spammers
who are getting good at sending legit-looking text and headers ;)
No matter what hostnames they use, unless they reinstall Outlook (as far
as I know) that should not change.

Quick question BTW -- I've been trying to keep our bayes-testing stats
close to yours, so we can compare portably.  But there's one thing I've
run into.  As far as I can see, in your 10-fold cross-validation suite,
you train using 1 fold and test against 9 -- whereas the published lit (or
at least Ion's papers) seems to suggest that 10FCV works better trained
against 9 and tested against 1.  Is there a reason you chose this?

PS: about time I posted here, I've been lurking and reading for weeks ;)

--j.



More information about the Spambayes mailing list