[Spambayes] Another software in the field
Fri Nov 15 17:21:19 2002
> Well, I've just started for SpamAssassin -- I'm gradually reinventing
> the wheel I think. For example, I've just found that including hapaxes
> improves the middle ground very well, which I think is something you
> guys did a long time ago ;)
Ya, ignoring hapaxes is a form of bias, and we eventually found that all
forms of bias hurt.
> But here's one thing I've noticed which might be useful for you guys.
> In SpamAssassin recently, we've been meditating on Message-Ids;
> particularly Outlook-format ones, like:
Hmm. I use Outlook 2000, and my last post had:
OTOH, a recent one from Paul Moore had:
and from Mark Hammond:
and from Sean True:
These are all (I believe) Outlook users. No $ in sight! I believe Paul is
alone in this group in using an Exchange server instead of straight SMTP.
> now, I've figured out this is composed of
> <???? TIMESTAMP $ ???????? $ SENDERID @ hostname>
> TIMESTAMP is the top 4 bytes of the FILETIME struct on windows, which
> we can validate in SpamAssassin using perl code.
What does "validate" mean in this context?
> not a runner for spambayes, unfortunately.
Post the Perl code and I bet it will be easy to do in Python too. I'm not
sure what you mean otherwise; for example, a FILETIME is conceptually a
64-bit integer, and by "top 4 bytes" it's unclear to me whether you mean the
most-significant 4 bytes of that int, or the first 4 bytes in storage order
(which happen to be the least-significant 4 bytes of the big int).
> However, SENDERID is a constant value which never changes for an
> Outlook or Exchange installation, as far as I can see -- so you want
> to make sure your tokenizer will parse message-ids, and will return
> that as one token.
> It will gain valuable probabilities for those tricky spammers
> who are getting good at sending legit-looking text and headers ;)
> No matter what hostnames they use, unless they reinstall Outlook
> (as far as I know) that should not change.
That would indeed be a great clue!
> Quick question BTW -- I've been trying to keep our bayes-testing stats
> close to yours, so we can compare portably. But there's one thing I've
> run into. As far as I can see, in your 10-fold cross-validation suite,
> you train using 1 fold and test against 9
That's backwards, although it's tricky: for speed, timcv.py:
+ Train on sets 2-10.
+ Predicts against set 1.
+ Incrementally trains set 1 (leaving the classifier trained on 1-10).
+ Incrementally *untrains* set 2 (leaving 1 + 3-10 trained).
+ Predicts against set 2.
+ Incrementailly trains set 2 (leaving 1-10 trained again).
+ Incrementally untrains set 3 (leaving 1-2 + 4-10 trained).
+ Predicts against set 3.
+ Incrementailly trains set 3 (levaing 1-10 trained again).
and so on. This has huge performance benefits, in both instruction count
and cache locality, versus running timcv.py with option
-- whereas the published lit (or at least Ion's papers) seems to
> suggest that 10FCV works better trained against 9 and tested against 1.
> Is there a reason you chose this?
I was looking for a new hobby after I stopped beating my wife <wink>.
timtest.py is an NxN grid driver, running N**2-N tests each training on 1
and predicting against N-1. That's a good way to get lots of hard test runs
if you have lots of data. timcv.py is vanilla cross-validation, running N
tests each training on N-1 and predicting against 1. README.txt and
TESTING.txt say more about all this.
> PS: about time I posted here, I've been lurking and reading for weeks ;)
Poor man -- I'm glad you uncloaked! Did the Outlook Message-Ids fit a
pattern you've seen? I'm keen to pursue that.
More information about the Spambayes