[spambayes-dev] A URL experiment

Skip Montanaro skip at pobox.com
Fri Jan 2 08:58:22 EST 2004


Happy New Year everyone...

As Tim predicted, mixing his url cracking ideas with mine leads to better
performance than either of our ideas in isolation.  Using the attached
patch, I get this summary output for a 10x10 timcv run:

    stds.txt -> pickurlss.txt
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
    -> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
    -> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams

    false positive percentages
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        1.667  1.667  tied          
        0.833  0.833  tied          
        0.833  0.833  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.833  0.833  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fp went from 5 to 5 tied          
    mean fp % went from 0.416666666667 to 0.416666666667 tied          

    false negative percentages
        7.874  7.874  tied          
        6.299  6.299  tied          
        9.449  9.449  tied          
        9.449  9.449  tied          
        10.236  10.236  tied          
        5.512  5.512  tied          
        7.087  6.299  won    -11.12%
        5.556  5.556  tied          
        7.937  7.937  tied          
        8.661  8.661  tied          

    won   1 times
    tied  9 times
    lost  0 times

    total unique fn went from 99 to 98 won     -1.01%
    mean fn % went from 7.80589926259 to 7.72715910511 won     -1.01%

    ham mean                     ham sdev
       2.11    2.12   +0.47%       12.36   12.36   +0.00%
       3.28    3.33   +1.52%       14.07   14.13   +0.43%
       1.11    1.13   +1.80%        6.75    6.86   +1.63%
       1.13    1.12   -0.88%        5.90    5.86   -0.68%
       3.44    3.43   -0.29%       14.07   14.06   -0.07%
       3.66    3.65   -0.27%       15.31   15.30   -0.07%
       3.68    3.67   -0.27%       13.65   13.62   -0.22%
       1.10    1.10   +0.00%        6.93    6.93   +0.00%
       1.70    1.78   +4.71%        8.80    9.02   +2.50%
       3.49    3.49   +0.00%       14.57   14.58   +0.07%

    ham mean and sdev for all runs
       2.47    2.48   +0.40%       11.83   11.85   +0.17%

    spam mean                    spam sdev
      84.79   84.96   +0.20%       29.71   29.56   -0.50%
      88.72   88.85   +0.15%       26.91   26.91   +0.00%
      83.53   83.99   +0.55%       30.40   30.26   -0.46%
      85.69   85.97   +0.33%       29.57   29.60   +0.10%
      84.47   84.59   +0.14%       30.42   30.45   +0.10%
      89.08   89.25   +0.19%       24.73   24.56   -0.69%
      87.08   87.73   +0.75%       27.80   27.05   -2.70%
      88.44   88.48   +0.05%       25.70   25.67   -0.12%
      87.20   87.23   +0.03%       28.53   28.54   +0.04%
      86.46   86.47   +0.01%       27.85   27.88   +0.11%

    spam mean and sdev for all runs
      86.54   86.75   +0.24%       28.28   28.17   -0.39%

    ham/spam mean difference: 84.07 84.27 +0.20

I also ran with bigrams enabled.  That helped more:

    stds.txt -> pickbis.txt
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
    -> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
    -> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
    -> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
    -> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams

    false positive percentages
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        1.667  1.667  tied          
        0.833  0.833  tied          
        0.833  0.833  tied          
        0.000  0.833  lost  +(was 0)
        0.000  0.000  tied          
        0.833  0.833  tied          

    won   0 times
    tied  9 times
    lost  1 times

    total unique fp went from 5 to 6 lost   +20.00%
    mean fp % went from 0.416666666667 to 0.5 lost   +20.00%

    false negative percentages
        7.874  6.299  won    -20.00%
        6.299  4.724  won    -25.00%
        9.449  6.299  won    -33.34%
        9.449  5.512  won    -41.67%
        10.236  4.724  won    -53.85%
        5.512  1.575  won    -71.43%
        7.087  5.512  won    -22.22%
        5.556  5.556  tied          
        7.937  7.937  tied          
        8.661  2.362  won    -72.73%

    won   8 times
    tied  2 times
    lost  0 times

    total unique fn went from 99 to 64 won    -35.35%
    mean fn % went from 7.80589926259 to 5.04999375078 won    -35.31%

    ham mean                     ham sdev
       2.11    1.61  -23.70%       12.36   10.88  -11.97%
       3.28    2.85  -13.11%       14.07   12.69   -9.81%
       1.11    1.05   -5.41%        6.75    6.13   -9.19%
       1.13    1.00  -11.50%        5.90    4.72  -20.00%
       3.44    3.19   -7.27%       14.07   14.75   +4.83%
       3.66    3.45   -5.74%       15.31   15.27   -0.26%
       3.68    2.67  -27.45%       13.65   11.70  -14.29%
       1.10    1.85  +68.18%        6.93   10.11  +45.89%
       1.70    1.93  +13.53%        8.80    9.23   +4.89%
       3.49    3.31   -5.16%       14.57   14.97   +2.75%

    ham mean and sdev for all runs
       2.47    2.29   -7.29%       11.83   11.60   -1.94%

    spam mean                    spam sdev
      84.79   86.82   +2.39%       29.71   27.17   -8.55%
      88.72   90.26   +1.74%       26.91   24.30   -9.70%
      83.53   87.45   +4.69%       30.40   26.76  -11.97%
      85.69   88.25   +2.99%       29.57   27.35   -7.51%
      84.47   88.02   +4.20%       30.42   25.64  -15.71%
      89.08   92.22   +3.52%       24.73   21.06  -14.84%
      87.08   91.45   +5.02%       27.80   23.48  -15.54%
      88.44   89.02   +0.66%       25.70   26.08   +1.48%
      87.20   87.78   +0.67%       28.53   28.58   +0.18%
      86.46   90.65   +4.85%       27.85   23.02  -17.34%

    spam mean and sdev for all runs
      86.54   89.19   +3.06%       28.28   25.50   -9.83%

    ham/spam mean difference: 84.07 86.90 +2.83

Skip

-------------- next part --------------
Index: spambayes/Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.97
diff -c -r1.97 Options.py
*** spambayes/Options.py        30 Dec 2003 16:26:33 -0000      1.97
--- spambayes/Options.py        2 Jan 2004 13:57:56 -0000
***************
*** 145,150 ****
--- 145,155 ----
       """(DEPRECATED) Extract day of the week tokens from the Date: header.""",
       BOOLEAN, RESTORE),
  
+     ("x-pick_apart_urls", "Extract clues about url structure", False,
+      """(EXPERIMENTAL) Note whether url contains non-standard port or
+      user/password elements.""",
+      BOOLEAN, RESTORE),
+ 
      ("replace_nonascii_chars", "Replace non-ascii characters", False,
       """If true, replace high-bit characters (ord(c) >= 128) and control
       characters with question marks.  This allows non-ASCII character
Index: spambayes/tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.27
diff -c -r1.27 tokenizer.py
*** spambayes/tokenizer.py      30 Dec 2003 16:26:33 -0000      1.27
--- spambayes/tokenizer.py      2 Jan 2004 13:57:56 -0000
***************
*** 13,18 ****
--- 13,20 ----
  import time
  import os
  import binascii
+ import urlparse
+ import urllib
  try:
      from sets import Set
  except ImportError:
***************
*** 1012,1025 ****
  
      def tokenize(self, m):
          proto, guts = m.groups()
          tokens = ["proto:" + proto]
          pushclue = tokens.append
  
          # Lose the trailing punctuation for casual embedding, like:
          #     The code is at http://mystuff.org/here?  Didn't resolve.
          # or
          #     I found it at http://mystuff.org/there/.  Thanks!
-         assert guts
          while guts and guts[-1] in '.:?!/':
              guts = guts[:-1]
          for piece in guts.split('/'):
--- 1014,1073 ----
  
      def tokenize(self, m):
          proto, guts = m.groups()
+         assert guts
          tokens = ["proto:" + proto]
          pushclue = tokens.append
  
+         if options["Tokenizer", "x-pick_apart_urls"]:
+             url = proto + "://" + guts
+ 
+             escapes = re.findall(r'%..', guts)
+             # roughly how many %nn escapes are there?
+             if escapes:
+                 pushclue("url:%%%d" % int(log2(len(escapes))))
+             # %nn escapes are usually intentional obfuscation.  Generate a
+             # lot of correlated tokens if the URL contains a lot of them.
+             # The classifier will learn which specific ones are and aren't
+             # spammy.
+             tokens.extend(["url:" + escape for escape in escapes])
+ 
+             # now remove any obfuscation and probe around a bit
+             url = urllib.unquote(url)
+             scheme, netloc, path, params, query, frag = urlparse.urlparse(url)
+ 
+             # one common technique in bogus "please (re-)authorize yourself"
+             # scams is to make it appear as if you're visiting a valid
+             # payment-oriented site like PayPal, CitiBank or eBay, when you
+             # actually aren't.  The company's web server appears as the
+             # beginning of an often long username element in the URL such as
+             # http://www.paypal.com%65%43%99%35@10.0.1.1/iwantyourccinfo
+             # generally with an innocuous-looking fragment of text or a
+             # valid URL as the highlighted link.  Usernames should rarely
+             # appear in URLs (perhaps in a local bookmark you established),
+             # and never in a URL you receive from an unsolicited email or
+             # another website.
+             user_pwd, host_port = urllib.splituser(netloc)
+             if user_pwd is not None:
+                 pushclue("url:has user")
+ 
+             host, port = urllib.splitport(host_port)
+             # web servers listening on non-standard ports are suspicious ...
+             if port is not None:
+                 if (scheme == "http" and port != '80' or
+                     scheme == "https" and port != '443'):
+                     pushclue("url:non-standard %s port" % scheme)
+ 
+             # ... as are web servers associated with raw ip addresses
+             if re.match("(\d+\.?){4,4}$", host) is not None:
+                 pushclue("url:ip addr")
+ 
+             # make sure we later tokenize the unobfuscated url bits
+             proto, guts = url.split("://", 1)
+ 
          # Lose the trailing punctuation for casual embedding, like:
          #     The code is at http://mystuff.org/here?  Didn't resolve.
          # or
          #     I found it at http://mystuff.org/there/.  Thanks!
          while guts and guts[-1] in '.:?!/':
              guts = guts[:-1]
          for piece in guts.split('/'):


More information about the spambayes-dev mailing list