[Spambayes] Another software in the field

Fri Nov 15 18:06:24 2002

Tim Peters said:

> Hmm.  I use Outlook 2000, and my last post had:
>  Message-id: <BIEJKCLHCIOIHAGOKOLHEEEFDPAA.tim.one@comcast.net>
...
> These are all (I believe) Outlook users.  No $ in sight!  I believe Paul is
> alone in this group in using an Exchange server instead of straight SMTP.

Hmm, we thought they were Exchange-format ids; looks like O2K now uses
that format.  (thinks) maybe it's just Outlook Express does the $ id
format -- but the important point is that it's frequently spoofed in spam
(about 29% of my spam load, for example).  So it becomes a great spam
indicator.  In fact, as Outlook users migrate *away* from that format,
it gets better ;)

BTW the O2K format IDs have not been spoofed yet, as far as I can see,
so they would be a good ham sign, if the tokenizer could recognise them.
as far as I know they always match /^<[A-Z]{28}\.\S+\@\S+>$/ .

> What does "validate" mean in this context?

compute what the value *should* be and compare.

> Post the Perl code and I bet it will be easy to do in Python too.  I'm not
> sure what you mean otherwise; for example, a FILETIME is conceptually a
> 64-bit integer, and by "top 4 bytes" it's unclear to me whether you mean the
> most-significant 4 bytes of that int, or the first 4 bytes in storage order
> (which happen to be the least-significant 4 bytes of the big int).

most significant.   Perl code is at the end of the mail...

> [ten-pass]
> That's backwards, although it's tricky:  for speed, timcv.py:
> + Train on sets 2-10.
> + Predicts against set 1.
> + Incrementally trains set 1 (leaving the classifier trained on 1-10).
> + Incrementally *untrains* set 2 (leaving 1 + 3-10 trained).
> + Predicts against set 2.
> + Incrementailly trains set 2 (leaving 1-10 trained again).
> + Incrementally untrains set 3 (leaving 1-2 + 4-10 trained).
> + Predicts against set 3.
> + Incrementailly trains set 3 (levaing 1-10 trained again).
> and so on.  This has huge performance benefits, in both instruction count
> and cache locality, versus running timcv.py with option
> build_each_classifier_from_scratch enabled.

OK -- I must have misread it.  so timcv.py *is* training on 9 sets each
time.  good.

> I was looking for a new hobby after I stopped beating my wife <wink>.
> timtest.py is an NxN grid driver, running N**2-N tests each training on 1
> and predicting against N-1.  That's a good way to get lots of hard test runs
> if you have lots of data.  timcv.py is vanilla cross-validation, running N
> tests each training on N-1 and predicting against 1.  README.txt  and
> TESTING.txt say more about all this.

bloody hell, timtest.py must take years to run ;)  sounds interesting.
BTW I hadn't read TESTING.txt (for some reason) -- I like the bigrams
story.

> Poor man -- I'm glad you uncloaked!  Did the Outlook Message-Ids fit a
> pattern you've seen?  I'm keen to pursue that.

yep, see above ;)

BTW here's the perl code.  it's cut and pasted from
current Mail::SpamAssassin::EvalTests, so it won't run as-is, but
it should be pretty easy to grok...

  # valid Outlookish Message-Ids contain the top word of the system time
  # when the message was sent!
  # We can verify this, by decoding the Date header, extracting
  # the time token from the Message-Id, and comparing them.
  #
  sub check_outlook_timestamp_token {
    my ($self) = @_;
    local ($_);

    my $id = $self->get ('Message-Id');
    return 0 unless ($id =~ /^<[0-9a-f]{4}([0-9a-f]{8})\$[0-9a-f]{8}\$[0-9a-f]{8}\@/);

    my $timetoken = hex($1);

    # convert UNIX time_t to Windows FILETIME.  From MSDN:
    #
    #     LONGLONG ll = Int32x32To64(t, 10000000) + 116444736000000000;
    #     pft->dwLowDateTime = (DWORD) ll;
    #     pft->dwHighDateTime = ll >>32;
    #
    # IOW, ((tt * a) + b) / c = id .
    # Now to avoid using any kind of LONGLONG data type, we do this:
    #     => tt * (a/c) + (b/c) = id
    #     let x = (a/c) = 0.0023283064365387
    #     let y = (b/c) = 27111902.8329849
    #
    my $x = 0.0023283064365387;
    my $y = 27111902.8329849;

    # quite generous, but we just want to be in the right ballpark, so we
    # can handle mostly-correct values OK, but catch random strings.
    my $fudge = 200;

    $_ = $self->get ('Date');
    $_ = $self->_parse_rfc822_date($_); $_ ||= 0;
    my $expected = int (($_ * $x) + $y);
    my $diff = $timetoken - $expected;
    dbg("time token found: $timetoken expected (from Date): $expected: $diff");
    if (abs ($diff) < $fudge) { return 0; }

    # also try last date in Received header, Date could have been rewritten
    $_ = $self->get ('Received');
    /(\s.?\d+ \S\S\S \d+ \d+:\d+:\d+ \S+).*?$/;
    dbg("last date in Received: $1");
    $_ = $self->_parse_rfc822_date($_); $_ ||= 0;
    $expected = int (($_ * $x) + $y);
    $diff = $timetoken - $expected;
    dbg("time token found: $timetoken expected (from Received): $expected: $diff");
    if (abs ($diff) < $fudge) { return 0; }

    return 1;
  }

  # parse an RFC822 date into a time_t
  sub _parse_rfc822_date {
    my ($self, $date) = @_;
    local ($_);
    my ($yyyy, $mmm, $dd, $hh, $mm, $ss, $mon, $tzoff);

    # make it a bit easier to match
    $_ = " $date "; s/, */ /gs; s/\s+/ /gs;

    # now match it in parts.  Date part first:
    if (s/ (\d+) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d{4}) / /i) {
      $dd = $1; $mon = $2; $yyyy = $3;
    } elsif (s/ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +(\d+) \d+:\d+:\d+ (\d{4}) / /i) {
      $dd = $2; $mon = $1; $yyyy = $3;
    } elsif (s/ (\d+) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d{2,3}) / /i) {
      $dd = $1; $mon = $2; $yyyy = $3;
    } else {
      dbg ("time cannot be parsed: $date");
      return undef;
    }

    # handle two and three digit dates as specified by RFC 2822
    if (defined $yyyy) {
      if (length($yyyy) == 2 && $yyyy < 50) {
	$yyyy += 2000;
      }
      elsif (length($yyyy) != 4) {
	# three digit years and two digit years with values between 50 and 99
	$yyyy += 1900;
      }
    }

    # hh:mm:ss
    if (s/ ([\d\s]\d):(\d\d)(:(\d\d))? / /) {
      $hh = $1; $mm = $2; $ss = $4 || 0;
    }

    # numeric timezones
    if (s/ ([-+]\d{4}) / /) {
      $tzoff = $1;
    }
    # all other timezones are considered equivalent to "-0000"
    $tzoff ||= '-0000';

    if (!defined $mmm && defined $mon) {
      my @months = qw(jan feb mar apr may jun jul aug sep oct nov dec);
      $mon = lc($mon);
      my $i; for ($i = 0; $i < 12; $i++) {
	if ($mon eq $months[$i]) { $mmm = $i+1; last; }
      }
    }

    $hh ||= 0; $mm ||= 0; $ss ||= 0; $dd ||= 0; $mmm ||= 0; $yyyy ||= 0;

    my $time;
    eval {		# could croak
      $time = timegm ($ss, $mm, $hh, $dd, $mmm-1, $yyyy);
    };

    if ($@) {
      dbg ("time cannot be parsed: $date, $yyyy-$mmm-$dd $hh:$mm:$ss");
      return undef;
    }

    if ($tzoff =~ /([-+])(\d\d)(\d\d)$/)	# convert to seconds difference
    {
      $tzoff = (($2 * 60) + $3) * 60;
      if ($1 eq '-') {
	$time += $tzoff;
      } else {
	$time -= $tzoff;
      }
    }

    return $time;
  }