[Spambayes] Cunning use of quoted-printable

Greg Ward gward@python.net
Tue, 1 Oct 2002 11:41:24 -0400


On 01 October 2002, Richie Hindle said:
[... message with lots of quoted-printable in it ...]
> Looks like an attempt to fox system like spambayes.  It doesn't make much
> difference, because the tokenizer decodes the quoted-printable, but it
> could trigger a clue token.

SpamAssassin has a test for this -- MIME_EXCESSIVE_QP:

rawbody  MIME_EXCESSIVE_QP      eval:check_for_mime_excessive_qp()
describe MIME_EXCESSIVE_QP      Excessive quoted-printable encoding in body
score MIME_EXCESSIVE_QP              2.070

The implementation is pretty simple:

  sub check_for_mime_excessive_qp {
    my ($self) = @_;

    # Note: We don't use rawbody because it removes MIME parts.  Instead,
    # we get the raw unfiltered body.  We must not change any lines.
    my $body = join('', @{$self->{msg}->get_body()});

    my $length = length($body);
    my $qp = $body =~ s/\=([0-9A-Fa-f]{2,2})/$1/g;

    # this seems like a decent cutoff
    return ($length != 0 && ($qp > ($length / 20)));
  }

(Hey, now that Matt Sergeant is on the list, I can stop being the local
SpamAssassin expert!  *phew*!)

I guess there are a couple of ways to translate this to a
stream-of-tokens approach:
  * do a tokenizing pass over the raw message body, and spit out
    a whole lot of "=20" tokens
  * examine the raw body in a non-tokenizing way, and just emit
    a "lots of quoted-printable" token
  * ...?

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
Did YOU find a DIGITAL WATCH in YOUR box of VELVEETA?