[spambayes-dev] A URL experiment
Skip Montanaro
skip at pobox.com
Fri Jan 2 08:58:22 EST 2004
Happy New Year everyone...
As Tim predicted, mixing his url cracking ideas with mine leads to better
performance than either of our ideas in isolation. Using the attached
patch, I get this summary output for a 10x10 timcv run:
stds.txt -> pickurlss.txt
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
-> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
-> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
1.667 1.667 tied
0.833 0.833 tied
0.833 0.833 tied
0.000 0.000 tied
0.000 0.000 tied
0.833 0.833 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 5 to 5 tied
mean fp % went from 0.416666666667 to 0.416666666667 tied
false negative percentages
7.874 7.874 tied
6.299 6.299 tied
9.449 9.449 tied
9.449 9.449 tied
10.236 10.236 tied
5.512 5.512 tied
7.087 6.299 won -11.12%
5.556 5.556 tied
7.937 7.937 tied
8.661 8.661 tied
won 1 times
tied 9 times
lost 0 times
total unique fn went from 99 to 98 won -1.01%
mean fn % went from 7.80589926259 to 7.72715910511 won -1.01%
ham mean ham sdev
2.11 2.12 +0.47% 12.36 12.36 +0.00%
3.28 3.33 +1.52% 14.07 14.13 +0.43%
1.11 1.13 +1.80% 6.75 6.86 +1.63%
1.13 1.12 -0.88% 5.90 5.86 -0.68%
3.44 3.43 -0.29% 14.07 14.06 -0.07%
3.66 3.65 -0.27% 15.31 15.30 -0.07%
3.68 3.67 -0.27% 13.65 13.62 -0.22%
1.10 1.10 +0.00% 6.93 6.93 +0.00%
1.70 1.78 +4.71% 8.80 9.02 +2.50%
3.49 3.49 +0.00% 14.57 14.58 +0.07%
ham mean and sdev for all runs
2.47 2.48 +0.40% 11.83 11.85 +0.17%
spam mean spam sdev
84.79 84.96 +0.20% 29.71 29.56 -0.50%
88.72 88.85 +0.15% 26.91 26.91 +0.00%
83.53 83.99 +0.55% 30.40 30.26 -0.46%
85.69 85.97 +0.33% 29.57 29.60 +0.10%
84.47 84.59 +0.14% 30.42 30.45 +0.10%
89.08 89.25 +0.19% 24.73 24.56 -0.69%
87.08 87.73 +0.75% 27.80 27.05 -2.70%
88.44 88.48 +0.05% 25.70 25.67 -0.12%
87.20 87.23 +0.03% 28.53 28.54 +0.04%
86.46 86.47 +0.01% 27.85 27.88 +0.11%
spam mean and sdev for all runs
86.54 86.75 +0.24% 28.28 28.17 -0.39%
ham/spam mean difference: 84.07 84.27 +0.20
I also ran with bigrams enabled. That helped more:
stds.txt -> pickbis.txt
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
-> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
-> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
-> <stat> tested 120 hams & 126 spams against 1080 hams & 1142 spams
-> <stat> tested 120 hams & 127 spams against 1080 hams & 1141 spams
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
1.667 1.667 tied
0.833 0.833 tied
0.833 0.833 tied
0.000 0.833 lost +(was 0)
0.000 0.000 tied
0.833 0.833 tied
won 0 times
tied 9 times
lost 1 times
total unique fp went from 5 to 6 lost +20.00%
mean fp % went from 0.416666666667 to 0.5 lost +20.00%
false negative percentages
7.874 6.299 won -20.00%
6.299 4.724 won -25.00%
9.449 6.299 won -33.34%
9.449 5.512 won -41.67%
10.236 4.724 won -53.85%
5.512 1.575 won -71.43%
7.087 5.512 won -22.22%
5.556 5.556 tied
7.937 7.937 tied
8.661 2.362 won -72.73%
won 8 times
tied 2 times
lost 0 times
total unique fn went from 99 to 64 won -35.35%
mean fn % went from 7.80589926259 to 5.04999375078 won -35.31%
ham mean ham sdev
2.11 1.61 -23.70% 12.36 10.88 -11.97%
3.28 2.85 -13.11% 14.07 12.69 -9.81%
1.11 1.05 -5.41% 6.75 6.13 -9.19%
1.13 1.00 -11.50% 5.90 4.72 -20.00%
3.44 3.19 -7.27% 14.07 14.75 +4.83%
3.66 3.45 -5.74% 15.31 15.27 -0.26%
3.68 2.67 -27.45% 13.65 11.70 -14.29%
1.10 1.85 +68.18% 6.93 10.11 +45.89%
1.70 1.93 +13.53% 8.80 9.23 +4.89%
3.49 3.31 -5.16% 14.57 14.97 +2.75%
ham mean and sdev for all runs
2.47 2.29 -7.29% 11.83 11.60 -1.94%
spam mean spam sdev
84.79 86.82 +2.39% 29.71 27.17 -8.55%
88.72 90.26 +1.74% 26.91 24.30 -9.70%
83.53 87.45 +4.69% 30.40 26.76 -11.97%
85.69 88.25 +2.99% 29.57 27.35 -7.51%
84.47 88.02 +4.20% 30.42 25.64 -15.71%
89.08 92.22 +3.52% 24.73 21.06 -14.84%
87.08 91.45 +5.02% 27.80 23.48 -15.54%
88.44 89.02 +0.66% 25.70 26.08 +1.48%
87.20 87.78 +0.67% 28.53 28.58 +0.18%
86.46 90.65 +4.85% 27.85 23.02 -17.34%
spam mean and sdev for all runs
86.54 89.19 +3.06% 28.28 25.50 -9.83%
ham/spam mean difference: 84.07 86.90 +2.83
Skip
-------------- next part --------------
Index: spambayes/Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.97
diff -c -r1.97 Options.py
*** spambayes/Options.py 30 Dec 2003 16:26:33 -0000 1.97
--- spambayes/Options.py 2 Jan 2004 13:57:56 -0000
***************
*** 145,150 ****
--- 145,155 ----
"""(DEPRECATED) Extract day of the week tokens from the Date: header.""",
BOOLEAN, RESTORE),
+ ("x-pick_apart_urls", "Extract clues about url structure", False,
+ """(EXPERIMENTAL) Note whether url contains non-standard port or
+ user/password elements.""",
+ BOOLEAN, RESTORE),
+
("replace_nonascii_chars", "Replace non-ascii characters", False,
"""If true, replace high-bit characters (ord(c) >= 128) and control
characters with question marks. This allows non-ASCII character
Index: spambayes/tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.27
diff -c -r1.27 tokenizer.py
*** spambayes/tokenizer.py 30 Dec 2003 16:26:33 -0000 1.27
--- spambayes/tokenizer.py 2 Jan 2004 13:57:56 -0000
***************
*** 13,18 ****
--- 13,20 ----
import time
import os
import binascii
+ import urlparse
+ import urllib
try:
from sets import Set
except ImportError:
***************
*** 1012,1025 ****
def tokenize(self, m):
proto, guts = m.groups()
tokens = ["proto:" + proto]
pushclue = tokens.append
# Lose the trailing punctuation for casual embedding, like:
# The code is at http://mystuff.org/here? Didn't resolve.
# or
# I found it at http://mystuff.org/there/. Thanks!
- assert guts
while guts and guts[-1] in '.:?!/':
guts = guts[:-1]
for piece in guts.split('/'):
--- 1014,1073 ----
def tokenize(self, m):
proto, guts = m.groups()
+ assert guts
tokens = ["proto:" + proto]
pushclue = tokens.append
+ if options["Tokenizer", "x-pick_apart_urls"]:
+ url = proto + "://" + guts
+
+ escapes = re.findall(r'%..', guts)
+ # roughly how many %nn escapes are there?
+ if escapes:
+ pushclue("url:%%%d" % int(log2(len(escapes))))
+ # %nn escapes are usually intentional obfuscation. Generate a
+ # lot of correlated tokens if the URL contains a lot of them.
+ # The classifier will learn which specific ones are and aren't
+ # spammy.
+ tokens.extend(["url:" + escape for escape in escapes])
+
+ # now remove any obfuscation and probe around a bit
+ url = urllib.unquote(url)
+ scheme, netloc, path, params, query, frag = urlparse.urlparse(url)
+
+ # one common technique in bogus "please (re-)authorize yourself"
+ # scams is to make it appear as if you're visiting a valid
+ # payment-oriented site like PayPal, CitiBank or eBay, when you
+ # actually aren't. The company's web server appears as the
+ # beginning of an often long username element in the URL such as
+ # http://www.paypal.com%65%43%99%35@10.0.1.1/iwantyourccinfo
+ # generally with an innocuous-looking fragment of text or a
+ # valid URL as the highlighted link. Usernames should rarely
+ # appear in URLs (perhaps in a local bookmark you established),
+ # and never in a URL you receive from an unsolicited email or
+ # another website.
+ user_pwd, host_port = urllib.splituser(netloc)
+ if user_pwd is not None:
+ pushclue("url:has user")
+
+ host, port = urllib.splitport(host_port)
+ # web servers listening on non-standard ports are suspicious ...
+ if port is not None:
+ if (scheme == "http" and port != '80' or
+ scheme == "https" and port != '443'):
+ pushclue("url:non-standard %s port" % scheme)
+
+ # ... as are web servers associated with raw ip addresses
+ if re.match("(\d+\.?){4,4}$", host) is not None:
+ pushclue("url:ip addr")
+
+ # make sure we later tokenize the unobfuscated url bits
+ proto, guts = url.split("://", 1)
+
# Lose the trailing punctuation for casual embedding, like:
# The code is at http://mystuff.org/here? Didn't resolve.
# or
# I found it at http://mystuff.org/there/. Thanks!
while guts and guts[-1] in '.:?!/':
guts = guts[:-1]
for piece in guts.split('/'):
More information about the spambayes-dev
mailing list