[spambayes-bugs] [ spambayes-Patches-917637 ] snagging some types
of word salad
SourceForge.net
noreply at sourceforge.net
Tue Mar 16 16:55:18 EST 2004
Patches item #917637, was opened at 2004-03-16 15:55
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=917637&group_id=61702
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Skip Montanaro (montanaro)
Assigned to: Nobody/Anonymous (nobody)
Summary: snagging some types of word salad
Initial Comment:
Based upon a comment in the procmail mailing list I
implemented the attached patch to try and detect some
types of word salad - that which contains random gibberish
(not random words). Based upon my current training
database both tokens it generates are fairly spammy:
% spamcounts -d ~/tmp/tte.db -r 'long cons word'
db: /Users/skip/tmp/tte.db
token,nspam,nham,spam prob
long cons word,31,7,0.801780167082
subject:long cons word,10,0,0.978468899522
I don't have much problem with word salad but some folks
seem to. I think it's more of a training problem than a
tokenizing problem, but I thought I'd save this patch for
posterity (and delete it from my source) in case others want
to investigate it.
The other kind of word salad (random words) might best be
detected by the classifier by keeping track of runs of
"natural" tokens (those which don't contain
whitespace or prefixes like "subject:") generated by the
tokenizer which aren't in the training database. Spam with
such word salad will probably have fairly long runs of such
words while in ham such runs will probably be broken up
frequently by common words. Just a thought.
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=917637&group_id=61702
More information about the Spambayes-bugs
mailing list