[Spambayes] Another software in the field

Fri Nov 15 17:21:19 2002

[Justin Mason]
> Well, I've just started for SpamAssassin -- I'm gradually reinventing
> the wheel I think.  For example, I've just found that including hapaxes
> improves the middle ground very well, which I think is something you
> guys did a long time ago ;)

Ya, ignoring hapaxes is a form of bias, and we eventually found that all
forms of bias hurt.

> But here's one thing I've noticed which might be useful for you guys.
> In SpamAssassin recently, we've been meditating on Message-Ids;
> particularly Outlook-format ones, like:
>
> 	<002901c28c22$3e8cb260$0201a8c0@gorm>

Hmm.  I use Outlook 2000, and my last post had:

 Message-id: <BIEJKCLHCIOIHAGOKOLHEEEFDPAA.tim.one@comcast.net>

OTOH, a recent one from Paul Moore had:

 Message-id:
 <16E1010E4581B049ABC51D4975CEDB885E2DCA@UKDCX001.uk.int.atosorigin.com>

and from Mark Hammond:

 Message-id: <LCEPIIGDJPKCOIHOBJEPEEJGHLAA.mhammond@skippinet.com.au>

and from Sean True:

 Message-id: <MJEHLHJKGINLONDMMKNEKELIHFAA.seant@iname.com>

These are all (I believe) Outlook users.  No $ in sight!  I believe Paul is
alone in this group in using an Exchange server instead of straight SMTP.

> now, I've figured out this is composed of
>
> 	<???? TIMESTAMP $ ???????? $ SENDERID @ hostname>
>
> TIMESTAMP is the top 4 bytes of the FILETIME struct on windows, which
> we can validate in SpamAssassin using perl code.

What does "validate" mean in this context?

> not a runner for spambayes, unfortunately.

Post the Perl code and I bet it will be easy to do in Python too.  I'm not
sure what you mean otherwise; for example, a FILETIME is conceptually a
64-bit integer, and by "top 4 bytes" it's unclear to me whether you mean the
most-significant 4 bytes of that int, or the first 4 bytes in storage order
(which happen to be the least-significant 4 bytes of the big int).

> However, SENDERID is a constant value which never changes for an
> Outlook or Exchange installation, as far as I can see -- so you want
> to make sure your tokenizer will parse message-ids, and will return
> that as one token.
>
> It will gain valuable probabilities for those tricky spammers
> who are getting good at sending legit-looking text and headers ;)
> No matter what hostnames they use, unless they reinstall Outlook
> (as far as I know) that should not change.

That would indeed be a great clue!

> Quick question BTW -- I've been trying to keep our bayes-testing stats
> close to yours, so we can compare portably.  But there's one thing I've
> run into.  As far as I can see, in your 10-fold cross-validation suite,
> you train using 1 fold and test against 9

That's backwards, although it's tricky:  for speed, timcv.py:

+ Train on sets 2-10.

+ Predicts against set 1.
+ Incrementally trains set 1 (leaving the classifier trained on 1-10).

+ Incrementally *untrains* set 2 (leaving 1 + 3-10 trained).
+ Predicts against set 2.
+ Incrementailly trains set 2 (leaving 1-10 trained again).

+ Incrementally untrains set 3 (leaving 1-2 + 4-10 trained).
+ Predicts against set 3.
+ Incrementailly trains set 3 (levaing 1-10 trained again).

and so on.  This has huge performance benefits, in both instruction count
and cache locality, versus running timcv.py with option
build_each_classifier_from_scratch enabled.

 -- whereas the published lit (or at least Ion's papers) seems to
> suggest that 10FCV works better trained against 9 and tested against 1.

Right.

> Is there a reason you chose this?

I was looking for a new hobby after I stopped beating my wife <wink>.
timtest.py is an NxN grid driver, running N**2-N tests each training on 1
and predicting against N-1.  That's a good way to get lots of hard test runs
if you have lots of data.  timcv.py is vanilla cross-validation, running N
tests each training on N-1 and predicting against 1.  README.txt  and
TESTING.txt say more about all this.

> PS: about time I posted here, I've been lurking and reading for weeks ;)

Poor man -- I'm glad you uncloaked!  Did the Outlook Message-Ids fit a
pattern you've seen?  I'm keen to pursue that.