[spambayes-dev] A URL experiment
Skip Montanaro
skip at pobox.com
Tue Dec 30 16:45:55 EST 2003
Tim> Over on the spambayes list yesterday, we were discussing a
Tim> particularly good identity-theft scam spam, purporting to be from
Tim> PayPal. It linked extensively to PayPal's real site, and about the
Tim> only fishy lexical thing was a highly obfuscated href (full of %
Tim> escapes).
Tim> We don't do anything special with % escapes in URLs now. Maybe we
Tim> should. The attached patch does.
I tried a somewhat different approach (patch is attached) and got similar
results (all ties at the more gross level, slight increase in spam mean and
slight decrease in spam sdev, no change to ham at all (*)):
stds.txt -> pickurlss.txt
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
false positive percentages
0.000 0.000 tied
0.400 0.400 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 5 times
lost 0 times
total unique fp went from 1 to 1 tied
mean fp % went from 0.08 to 0.08 tied
false negative percentages
3.333 3.333 tied
5.000 5.000 tied
7.333 7.333 tied
5.667 5.667 tied
4.000 4.000 tied
won 0 times
tied 5 times
lost 0 times
total unique fn went from 76 to 76 tied
mean fn % went from 5.06666666667 to 5.06666666667 tied
ham mean ham sdev
1.64 1.64 +0.00% 8.44 8.44 +0.00%
0.99 0.99 +0.00% 8.29 8.29 +0.00%
2.82 2.82 +0.00% 12.52 12.52 +0.00%
1.58 1.58 +0.00% 8.29 8.29 +0.00%
1.30 1.30 +0.00% 8.04 8.04 +0.00%
ham mean and sdev for all runs
1.66 1.66 +0.00% 9.30 9.30 +0.00%
spam mean spam sdev
93.80 93.82 +0.02% 19.39 19.35 -0.21%
90.56 90.58 +0.02% 24.31 24.26 -0.21%
89.24 89.27 +0.03% 27.03 27.04 +0.04%
89.27 89.27 +0.00% 25.51 25.50 -0.04%
92.72 92.74 +0.02% 21.67 21.67 +0.00%
spam mean and sdev for all runs
91.12 91.14 +0.02% 23.81 23.80 -0.04%
ham/spam mean difference: 89.46 89.48 +0.02
(*) Operational question: Given that my training data is somewhat small at
the moment (roughly 1000-1500 each of ham and spam), would I be better off
testing with fewer larger sets (e.g, 5 sets w/ 250 msgs each) or with more
smaller sets (e.g, 10 sets w/ 125 msgs each)?
Skip
-------------- next part --------------
Index: spambayes/Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.97
diff -c -r1.97 Options.py
*** spambayes/Options.py 30 Dec 2003 16:26:33 -0000 1.97
--- spambayes/Options.py 30 Dec 2003 21:42:48 -0000
***************
*** 145,150 ****
--- 145,155 ----
"""(DEPRECATED) Extract day of the week tokens from the Date: header.""",
BOOLEAN, RESTORE),
+ ("x-pick_apart_urls", "Extract clues about url structure", False,
+ """(EXPERIMENTAL) Note whether url contains non-standard port or
+ user/password elements.""",
+ BOOLEAN, RESTORE),
+
("replace_nonascii_chars", "Replace non-ascii characters", False,
"""If true, replace high-bit characters (ord(c) >= 128) and control
characters with question marks. This allows non-ASCII character
Index: spambayes/tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.27
diff -c -r1.27 tokenizer.py
*** spambayes/tokenizer.py 30 Dec 2003 16:26:33 -0000 1.27
--- spambayes/tokenizer.py 30 Dec 2003 21:42:48 -0000
***************
*** 13,18 ****
--- 13,20 ----
import time
import os
import binascii
+ import urlparse
+ import urllib
try:
from sets import Set
except ImportError:
***************
*** 1014,1019 ****
--- 1016,1038 ----
proto, guts = m.groups()
tokens = ["proto:" + proto]
pushclue = tokens.append
+
+ if options["Tokenizer", "x-pick_apart_urls"]:
+ url = proto + "://" + guts
+ num_pcs = url.count("%")
+ if num_pcs:
+ pushclue("url:%d %%s" % num_pcs)
+ url = urllib.unquote(url)
+ scheme, netloc, path, params, query, frag = urlparse.urlparse(url)
+ user_pwd, host_port = urllib.splituser(netloc)
+ if user_pwd is not None:
+ pushclue("url:has user")
+ host, port = urllib.splitport(host_port)
+ if port is not None:
+ if scheme == "http" and port != '80':
+ pushclue("url:non-standard http port")
+ elif scheme == "https" and port != '443':
+ pushclue("url:non-standard https port")
# Lose the trailing punctuation for casual embedding, like:
# The code is at http://mystuff.org/here? Didn't resolve.
More information about the spambayes-dev
mailing list