[spambayes-dev] A URL experiment

Tue Dec 30 14:17:56 EST 2003

Over on the spambayes list yesterday, we were discussing a particularly good
identity-theft scam spam, purporting to be from PayPal.  It linked
extensively to PayPal's real site, and about the only fishy lexical thing
was a highly obfuscated href (full of % escapes).

We don't do anything special with % escapes in URLs now.  Maybe we should.
The attached patch does.

I don't have enough personal email saved to make for a good test, but who
cares <wink>.  I just took what I had, slammed into randomly into 10 even
sets, and did "the usual" cross-validation business on it.  All of this
email is less than a week old, is all the email I've gotten since then, is
atypical for me (Christmas time -> a lot less email than usual, but a spike
in personal email), and runs 3:1 in favor of ham.  None of that matters,
though -- *whatever* you have, and however you train, the interesting
question is just how it does with the patch, compared to without it.

I ran my 10-fold CV with "the default" settings for Outlook.  These match
the current (CVS) project defaults, with the addition of

[Tokenizer]
replace_nonascii_chars: True
record_header_absence: True

I'm *not* using mine_received_headers or x-use_bigrams in these tests.

befores -> afters
-> <stat> tested 151 hams & 52 spams against 1359 hams & 468 spams
[19 repetitions of that]

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.662  0.662  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fp went from 1 to 1 tied
mean fp % went from 0.0662251655629 to 0.0662251655629 tied

false negative percentages
    1.923  1.923  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    1.923  1.923  tied
    1.923  1.923  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fn went from 3 to 3 tied
mean fn % went from 0.576923076924 to 0.576923076924 tied

ham mean                     ham sdev
   0.44    0.44   +0.00%        4.52    4.52   +0.00%
   0.34    0.34   +0.00%        4.11    4.11   +0.00%
   0.27    0.27   +0.00%        3.16    3.16   +0.00%
   0.17    0.17   +0.00%        1.51    1.51   +0.00%
   1.06    1.06   +0.00%        9.11    9.12   +0.11%
   0.00    0.00 +(was 0)        0.01    0.01   +0.00%
   0.78    0.78   +0.00%        8.16    8.16   +0.00%
   0.42    0.43   +2.38%        5.19    5.21   +0.39%
   0.01    0.01   +0.00%        0.11    0.11   +0.00%
   0.07    0.07   +0.00%        0.90    0.90   +0.00%

ham mean and sdev for all runs
   0.36    0.36   +0.00%        4.77    4.78   +0.21%

spam mean                    spam sdev
  96.41   96.43   +0.02%       13.52   13.51   -0.07%
  98.51   98.56   +0.05%        6.99    6.99   +0.00%
  97.80   97.80   +0.00%        6.42    6.41   -0.16%
  98.21   98.22   +0.01%        7.31    7.30   -0.14%
  93.00   93.03   +0.03%       16.68   16.66   -0.12%
  97.40   97.41   +0.01%        8.29    8.27   -0.24%
  97.58   97.70   +0.12%       12.30   12.18   -0.98%
  97.01   97.02   +0.01%       14.38   14.37   -0.07%
  95.90   96.03   +0.14%       11.61   11.46   -1.29%
  98.86   98.86   +0.00%        6.12    6.11   -0.16%

spam mean and sdev for all runs
  97.07   97.11   +0.04%       11.09   11.05   -0.36%

ham/spam mean difference: 96.71 96.75 +0.04

Not much to talk about there!  Pretty much indistinguishable, although the
spam mean went up a tad consistently, and the spam sdev down a tad
consistently.

table.py's "best cost" output shows that I could have reduced the optimal
cost by 1 unsure if I changed my cutoffs:

filename:   before   after
ham:spam:  1510:520
                   1510:520
fp total:        1       1
fp %:         0.07    0.07
fn total:        3       3
fn %:         0.58    0.58
unsure t:       39      39
unsure %:     1.92    1.92
real cost:  $20.80  $20.80
best cost:  $17.60  $17.40
h mean:       0.36    0.36
h sdev:       4.77    4.78
s mean:      97.07   97.11
s sdev:      11.09   11.05
mean diff:   96.71   96.75
k:            6.10    6.11

So the change would have been the tiniest of wins for me.  For you?

BTW, the fp here was an "end of year sale" blaring HTML ad from Gateway.
That's ham to me, but there are no other msgs from Gateway in this email.
It contains enough Gateway-specific lexicalisms that training on one is
enough to score future ones as solid ham.  The PayPal scam that started this
remained a solid FN.
-------------- next part --------------
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.27
diff -c -u -r1.27 tokenizer.py

--- tokenizer.py	30 Dec 2003 16:26:33 -0000	1.27
+++ tokenizer.py	30 Dec 2003 18:45:59 -0000
@@ -1011,9 +1011,25 @@
         Stripper.__init__(self, url_re.search, re.compile("").search)
 
     def tokenize(self, m):
+        import urllib
+
         proto, guts = m.groups()
         tokens = ["proto:" + proto]
         pushclue = tokens.append
+
+        # %nn escapes are usually intentional obfuscation.  Generate a lot
+        # of correlated tokens if the URL contains a lot of them.  The
+        # classifier will learn which specific ones are and aren't spammy.
+        escapes = re.findall(r'%..', guts)
+        tokens.extend(["url:" + escape for escape in escapes])
+
+        try:
+            # Tokenize the unobfuscated URL.
+            guts = urllib.unquote(guts)
+        except:
+            pushclue("url:invalid escapes")
+            # And guts is unchanged; however, I don't think urllib.unquote()
+            # ever raises an exception now.
 
         # Lose the trailing punctuation for casual embedding, like:
         #     The code is at http://mystuff.org/here?  Didn't resolve.