[spambayes-dev] spammy subject lines
tim.one at comcast.net
Sun Oct 12 13:47:40 EDT 2003
> I added the following function:
> def leaveOnlyLetters(self, s):
> # Return s with any characters not in string.letters removed.
> import string
> return filter(lambda c: c in string.letters, s)
> And appended
> # Add words with non-letters removed.
> for w in x.split():
> yield 'subject:' + self.leaveOnlyLetters(w)
> At the end of the section to handle subject lines. It seems that a
> subject like "stati.stics" generates a clue like "subject:.",
Among others, yes.
> presumably via punctuation_run_re.findall(x)
That is the source of "subject:.".
> and my "cleaned up" subject token doesn't appear to get through - Ie I
> don't think it is working as I expect.
Why do you think that? We can't see what you did, and you didn't spell out
The leaveOnlyLetters method seems to work ok.
It should. Here's a patch for a more-efficient way; I know that it works; I
don't know whether it helps or hurts, though; I'm trying it now and too soon
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.15
diff -c -u -r1.15 tokenizer.py
--- spambayes/tokenizer.py 18 Sep 2003 21:00:10 -0000 1.15
+++ spambayes/tokenizer.py 12 Oct 2003 17:42:41 -0000
@@ -642,6 +642,13 @@
subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
punctuation_run_re = re.compile(r'\W+')
+# A function to strip punctuation characters, for use in Subject line
+id_map = ''.join(map(chr, range(256))) # maps each char to itself
+ import string
+ return string.translate(word, id_map, string.punctuation)
fname_sep_re = re.compile(r'[/\\:]')
@@ -1154,6 +1161,8 @@
yield 'subject:' + t
for w in punctuation_run_re.findall(x):
yield 'subject:' + w
+ for w in x.split():
+ yield 'subject:' + strip_punctuation(w)
# Dang -- I can't use Sender:. If I do,
# 'sender:email name:python-list-admin'
More information about the spambayes-dev