[Spambayes] It gets funnier all the time....

Bill Yerazunis wsy at merl.com
Wed Feb 12 17:31:12 EST 2003


   From: Tim Stone - Four Stones Expressions <tim at fourstonesExpressions.com>

   >> But that means that if we wan't to be able to use the clues in
   >> spambayes, we either have to make a token base64-encoding-missing or
   >> we have to decode it to get the clues from the body.
   >
   >Generating a clue sounds best, assuming SB doesn't nail it already.

   I doubt that the tokenizer would generate any meaningful tokens from this 
   message.  Generating a token would be the right way to do it, any ideas how?  
Tim:

The problem in detecting an un-marked base64 is that the base64 itself is
pretty much indistingushable from one-word-per-line text.

The regex for base64's that CRM114 uses is

 \n\n(([a-zA-Z0-9+=\/]\{55,80\}\n)\{4,200\}.\{0,80\}\n)

(where \n has the usual C-ish meaning of an embedded newline)

The problem is that this regex may misfire on one-word-per-line text; that's
why it requires at least four such lines, uinterrupted, with each line of
at least 55 characters, and at least two leading blank lines.

You can also use the matched text as input to the 'mimencode -u" shell
command to actually un-encode the base64 and work against the text inside,
which is what CRM114 itself does (the presence of base64 without headers
marking it as such is a no-op).

Anyway, give it a try and see if it works for you.

	-Bill Yerazunis (mostly CRM114, but it's good to cross-pollinate)



More information about the Spambayes mailing list