[spambayes-dev] spammy subject lines

Sun Oct 12 13:47:40 EDT 2003

[Paul Sorenson]
> I added the following function:
>     def leaveOnlyLetters(self, s):
>         # Return s with any characters not in string.letters removed.
>         import string
>         return filter(lambda c: c in string.letters, s)
>
> And appended
>             # Add words with non-letters removed.
>             for w in x.split():
>                 yield 'subject:' + self.leaveOnlyLetters(w)
>
> At the end of the section to handle subject lines.  It seems that a
> subject like "stati.stics" generates a clue like "subject:.",

Among others, yes.

> presumably via punctuation_run_re.findall(x)

That is the source of "subject:.".

> and my "cleaned up" subject token doesn't appear to get through - Ie I
> don't think it is working as I expect.

Why do you think that?  We can't see what you did, and you didn't spell out
your evidence.

  The leaveOnlyLetters method seems to work ok.

It should.  Here's a patch for a more-efficient way; I know that it works; I
don't know whether it helps or hurts, though; I'm trying it now and too soon
to say.


Index: spambayes/tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.15
diff -c -u -r1.15 tokenizer.py

--- spambayes/tokenizer.py      18 Sep 2003 21:00:10 -0000      1.15
+++ spambayes/tokenizer.py      12 Oct 2003 17:42:41 -0000
@@ -642,6 +642,13 @@
 subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
 punctuation_run_re = re.compile(r'\W+')

+# A function to strip punctuation characters, for use in Subject line
+# de-obfuscation.
+id_map = ''.join(map(chr, range(256)))  # maps each char to itself
+def strip_punctuation(word):
+    import string
+    return string.translate(word, id_map, string.punctuation)
+
 fname_sep_re = re.compile(r'[/\\:]')

 def crack_filename(fname):
@@ -1154,6 +1161,8 @@
                     yield 'subject:' + t
             for w in punctuation_run_re.findall(x):
                 yield 'subject:' + w
+            for w in x.split():
+                yield 'subject:' + strip_punctuation(w)

         # Dang -- I can't use Sender:.  If I do,
         #     'sender:email name:python-list-admin'