[spambayes-dev] A URL experiment

Tue Dec 30 16:45:55 EST 2003

    Tim> Over on the spambayes list yesterday, we were discussing a
    Tim> particularly good identity-theft scam spam, purporting to be from
    Tim> PayPal.  It linked extensively to PayPal's real site, and about the
    Tim> only fishy lexical thing was a highly obfuscated href (full of %
    Tim> escapes).

    Tim> We don't do anything special with % escapes in URLs now.  Maybe we
    Tim> should.  The attached patch does.

I tried a somewhat different approach (patch is attached) and got similar
results (all ties at the more gross level, slight increase in spam mean and
slight decrease in spam sdev, no change to ham at all (*)):

stds.txt -> pickurlss.txt
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams

false positive percentages
    0.000  0.000  tied          
    0.400  0.400  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fp went from 1 to 1 tied          
mean fp % went from 0.08 to 0.08 tied          

false negative percentages
    3.333  3.333  tied          
    5.000  5.000  tied          
    7.333  7.333  tied          
    5.667  5.667  tied          
    4.000  4.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fn went from 76 to 76 tied          
mean fn % went from 5.06666666667 to 5.06666666667 tied          

ham mean                     ham sdev
   1.64    1.64   +0.00%        8.44    8.44   +0.00%
   0.99    0.99   +0.00%        8.29    8.29   +0.00%
   2.82    2.82   +0.00%       12.52   12.52   +0.00%
   1.58    1.58   +0.00%        8.29    8.29   +0.00%
   1.30    1.30   +0.00%        8.04    8.04   +0.00%

ham mean and sdev for all runs
   1.66    1.66   +0.00%        9.30    9.30   +0.00%

spam mean                    spam sdev
  93.80   93.82   +0.02%       19.39   19.35   -0.21%
  90.56   90.58   +0.02%       24.31   24.26   -0.21%
  89.24   89.27   +0.03%       27.03   27.04   +0.04%
  89.27   89.27   +0.00%       25.51   25.50   -0.04%
  92.72   92.74   +0.02%       21.67   21.67   +0.00%

spam mean and sdev for all runs
  91.12   91.14   +0.02%       23.81   23.80   -0.04%

ham/spam mean difference: 89.46 89.48 +0.02

(*) Operational question: Given that my training data is somewhat small at
the moment (roughly 1000-1500 each of ham and spam), would I be better off
testing with fewer larger sets (e.g, 5 sets w/ 250 msgs each) or with more
smaller sets (e.g, 10 sets w/ 125 msgs each)?

Skip

-------------- next part --------------
Index: spambayes/Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.97
diff -c -r1.97 Options.py
*** spambayes/Options.py        30 Dec 2003 16:26:33 -0000      1.97
--- spambayes/Options.py        30 Dec 2003 21:42:48 -0000
***************
*** 145,150 ****
--- 145,155 ----
       """(DEPRECATED) Extract day of the week tokens from the Date: header.""",
       BOOLEAN, RESTORE),

+     ("x-pick_apart_urls", "Extract clues about url structure", False,
+      """(EXPERIMENTAL) Note whether url contains non-standard port or
+      user/password elements.""",
+      BOOLEAN, RESTORE),
+ 
      ("replace_nonascii_chars", "Replace non-ascii characters", False,
       """If true, replace high-bit characters (ord(c) >= 128) and control
       characters with question marks.  This allows non-ASCII character
Index: spambayes/tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.27
diff -c -r1.27 tokenizer.py
*** spambayes/tokenizer.py      30 Dec 2003 16:26:33 -0000      1.27
--- spambayes/tokenizer.py      30 Dec 2003 21:42:48 -0000
***************
*** 13,18 ****
--- 13,20 ----
  import time
  import os
  import binascii
+ import urlparse
+ import urllib
  try:
      from sets import Set
  except ImportError:
***************
*** 1014,1019 ****
--- 1016,1038 ----
          proto, guts = m.groups()
          tokens = ["proto:" + proto]
          pushclue = tokens.append
+ 
+         if options["Tokenizer", "x-pick_apart_urls"]:
+             url = proto + "://" + guts
+             num_pcs = url.count("%")
+             if num_pcs:
+                 pushclue("url:%d %%s" % num_pcs)
+             url = urllib.unquote(url)
+             scheme, netloc, path, params, query, frag = urlparse.urlparse(url)
+             user_pwd, host_port = urllib.splituser(netloc)
+             if user_pwd is not None:
+                 pushclue("url:has user")
+             host, port = urllib.splitport(host_port)
+             if port is not None:
+                 if scheme == "http" and port != '80':
+                     pushclue("url:non-standard http port")
+                 elif scheme == "https" and port != '443':
+                     pushclue("url:non-standard https port")

          # Lose the trailing punctuation for casual embedding, like:
          #     The code is at http://mystuff.org/here?  Didn't resolve.