[Spambayes] Another software in the field
Justin Mason
jm@jmason.org
Fri Nov 15 18:06:24 2002
Tim Peters said:
> Hmm. I use Outlook 2000, and my last post had:
> Message-id: <BIEJKCLHCIOIHAGOKOLHEEEFDPAA.tim.one@comcast.net>
...
> These are all (I believe) Outlook users. No $ in sight! I believe Paul is
> alone in this group in using an Exchange server instead of straight SMTP.
Hmm, we thought they were Exchange-format ids; looks like O2K now uses
that format. (thinks) maybe it's just Outlook Express does the $ id
format -- but the important point is that it's frequently spoofed in spam
(about 29% of my spam load, for example). So it becomes a great spam
indicator. In fact, as Outlook users migrate *away* from that format,
it gets better ;)
BTW the O2K format IDs have not been spoofed yet, as far as I can see,
so they would be a good ham sign, if the tokenizer could recognise them.
as far as I know they always match /^<[A-Z]{28}\.\S+\@\S+>$/ .
> What does "validate" mean in this context?
compute what the value *should* be and compare.
> Post the Perl code and I bet it will be easy to do in Python too. I'm not
> sure what you mean otherwise; for example, a FILETIME is conceptually a
> 64-bit integer, and by "top 4 bytes" it's unclear to me whether you mean the
> most-significant 4 bytes of that int, or the first 4 bytes in storage order
> (which happen to be the least-significant 4 bytes of the big int).
most significant. Perl code is at the end of the mail...
> [ten-pass]
> That's backwards, although it's tricky: for speed, timcv.py:
> + Train on sets 2-10.
> + Predicts against set 1.
> + Incrementally trains set 1 (leaving the classifier trained on 1-10).
> + Incrementally *untrains* set 2 (leaving 1 + 3-10 trained).
> + Predicts against set 2.
> + Incrementailly trains set 2 (leaving 1-10 trained again).
> + Incrementally untrains set 3 (leaving 1-2 + 4-10 trained).
> + Predicts against set 3.
> + Incrementailly trains set 3 (levaing 1-10 trained again).
> and so on. This has huge performance benefits, in both instruction count
> and cache locality, versus running timcv.py with option
> build_each_classifier_from_scratch enabled.
OK -- I must have misread it. so timcv.py *is* training on 9 sets each
time. good.
> I was looking for a new hobby after I stopped beating my wife <wink>.
> timtest.py is an NxN grid driver, running N**2-N tests each training on 1
> and predicting against N-1. That's a good way to get lots of hard test runs
> if you have lots of data. timcv.py is vanilla cross-validation, running N
> tests each training on N-1 and predicting against 1. README.txt and
> TESTING.txt say more about all this.
bloody hell, timtest.py must take years to run ;) sounds interesting.
BTW I hadn't read TESTING.txt (for some reason) -- I like the bigrams
story.
> Poor man -- I'm glad you uncloaked! Did the Outlook Message-Ids fit a
> pattern you've seen? I'm keen to pursue that.
yep, see above ;)
BTW here's the perl code. it's cut and pasted from
current Mail::SpamAssassin::EvalTests, so it won't run as-is, but
it should be pretty easy to grok...
# valid Outlookish Message-Ids contain the top word of the system time
# when the message was sent!
# We can verify this, by decoding the Date header, extracting
# the time token from the Message-Id, and comparing them.
#
sub check_outlook_timestamp_token {
my ($self) = @_;
local ($_);
my $id = $self->get ('Message-Id');
return 0 unless ($id =~ /^<[0-9a-f]{4}([0-9a-f]{8})\$[0-9a-f]{8}\$[0-9a-f]{8}\@/);
my $timetoken = hex($1);
# convert UNIX time_t to Windows FILETIME. From MSDN:
#
# LONGLONG ll = Int32x32To64(t, 10000000) + 116444736000000000;
# pft->dwLowDateTime = (DWORD) ll;
# pft->dwHighDateTime = ll >>32;
#
# IOW, ((tt * a) + b) / c = id .
# Now to avoid using any kind of LONGLONG data type, we do this:
# => tt * (a/c) + (b/c) = id
# let x = (a/c) = 0.0023283064365387
# let y = (b/c) = 27111902.8329849
#
my $x = 0.0023283064365387;
my $y = 27111902.8329849;
# quite generous, but we just want to be in the right ballpark, so we
# can handle mostly-correct values OK, but catch random strings.
my $fudge = 200;
$_ = $self->get ('Date');
$_ = $self->_parse_rfc822_date($_); $_ ||= 0;
my $expected = int (($_ * $x) + $y);
my $diff = $timetoken - $expected;
dbg("time token found: $timetoken expected (from Date): $expected: $diff");
if (abs ($diff) < $fudge) { return 0; }
# also try last date in Received header, Date could have been rewritten
$_ = $self->get ('Received');
/(\s.?\d+ \S\S\S \d+ \d+:\d+:\d+ \S+).*?$/;
dbg("last date in Received: $1");
$_ = $self->_parse_rfc822_date($_); $_ ||= 0;
$expected = int (($_ * $x) + $y);
$diff = $timetoken - $expected;
dbg("time token found: $timetoken expected (from Received): $expected: $diff");
if (abs ($diff) < $fudge) { return 0; }
return 1;
}
# parse an RFC822 date into a time_t
sub _parse_rfc822_date {
my ($self, $date) = @_;
local ($_);
my ($yyyy, $mmm, $dd, $hh, $mm, $ss, $mon, $tzoff);
# make it a bit easier to match
$_ = " $date "; s/, */ /gs; s/\s+/ /gs;
# now match it in parts. Date part first:
if (s/ (\d+) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d{4}) / /i) {
$dd = $1; $mon = $2; $yyyy = $3;
} elsif (s/ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +(\d+) \d+:\d+:\d+ (\d{4}) / /i) {
$dd = $2; $mon = $1; $yyyy = $3;
} elsif (s/ (\d+) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d{2,3}) / /i) {
$dd = $1; $mon = $2; $yyyy = $3;
} else {
dbg ("time cannot be parsed: $date");
return undef;
}
# handle two and three digit dates as specified by RFC 2822
if (defined $yyyy) {
if (length($yyyy) == 2 && $yyyy < 50) {
$yyyy += 2000;
}
elsif (length($yyyy) != 4) {
# three digit years and two digit years with values between 50 and 99
$yyyy += 1900;
}
}
# hh:mm:ss
if (s/ ([\d\s]\d):(\d\d)(:(\d\d))? / /) {
$hh = $1; $mm = $2; $ss = $4 || 0;
}
# numeric timezones
if (s/ ([-+]\d{4}) / /) {
$tzoff = $1;
}
# all other timezones are considered equivalent to "-0000"
$tzoff ||= '-0000';
if (!defined $mmm && defined $mon) {
my @months = qw(jan feb mar apr may jun jul aug sep oct nov dec);
$mon = lc($mon);
my $i; for ($i = 0; $i < 12; $i++) {
if ($mon eq $months[$i]) { $mmm = $i+1; last; }
}
}
$hh ||= 0; $mm ||= 0; $ss ||= 0; $dd ||= 0; $mmm ||= 0; $yyyy ||= 0;
my $time;
eval { # could croak
$time = timegm ($ss, $mm, $hh, $dd, $mmm-1, $yyyy);
};
if ($@) {
dbg ("time cannot be parsed: $date, $yyyy-$mmm-$dd $hh:$mm:$ss");
return undef;
}
if ($tzoff =~ /([-+])(\d\d)(\d\d)$/) # convert to seconds difference
{
$tzoff = (($2 * 60) + $3) * 60;
if ($1 eq '-') {
$time += $tzoff;
} else {
$time -= $tzoff;
}
}
return $time;
}
More information about the Spambayes
mailing list