[Spambayes] Re: [SAtalk] spampot -- spam honeypot server (fwd)

Justin Mason jm at jmason.org
Tue Jan 21 11:20:53 EST 2003


Matt Sergeant said:
> My guess is you'd need to put some sort of Razor-like signature 
> checking in place (perhaps using Pyzor) to remove dupes.

Actually, I have some rough-but-working-well-enough perl code in
SpamAssassin CVS, in the "masses/corpora" dir, which does this.
"fuzzy-hash-maildir" is the script in question.  Here's how it works:

  - for each mail:

    - strip all HTML tags

    - strip text in "quotes" -- vars in javascript, etc.

    - remove words with ? marks inside them, possible encoded mail addrs

    - remove words with @ marks inside them, possible encoded mail addrs

    - remove lines that contain just a single string of non-white chars,
      possible hash busters or encoded mail addrs

    - split into an array of lines (NOT bytes, since spammers are using
      variable-length hash-busting strings)

    - divide into 4 blocks and hash them: hash1, hash2, hash3, hash4

    - output into associative arrays as
	hash1.hash2 -> filename
	hash1.hash2.hash3 -> filename
	hash1.hash2.hash3.hash4 -> filename
      (should probably use e.g. hash2.hash3.hash4 as well.  Note that
      hashbusters and encoded addrs generally appear in the first and/or
      last blocks.)

  - finally check those arrays for collisions and output these as "likely
    dups".

It works sufficiently well. ;)

--j.



More information about the Spambayes mailing list