I've gotten interesting results from this gimmick:
import re url_re = re.compile(r"http://(%5B%5E%5Cs%3E%27%5C%22%5Cx7f-%5Cxff%5D+)", re.IGNORECASE) urlfield_re = re.compile(r"[;?:@&=+,$.]")
def tokenize_url(string): for url in url_re.findall(string): for i, piece in enumerate(url.lower().split('/')): prefix = "url%d:" % i for chunk in urlfield_re.split(piece): yield prefix + chunk ... (and then do other tokenization) ...
So it splits a case-normalized http thingie via /, tags the first piece "url0:", the second "url1:", and so on. Within each piece, it splits on separators, like '=' and '.'.
Two particular tokens generated this way then made it into the list of 15 words that most often survived to the end of the scoring step:
url0:python as a strong non-spam indicator url1:remove as a strong spam indicator
The rest of the tokenization was unchanged, still doing MIME-ignorant splitting on whitespace. Just the http gimmick was added, and that alone cut the false negative rate in half. IOW, there's a *lot* of valuable info in the http thingies! Not being a Web Guy, I'm not sure how to extract the most info from it. If you've got suggestions for a better URL tagging strategy, I'd love to hear them.
Cute: If I tokenize *only* the http thingies, ignoring all other parts of the text, the false positive rate is about 1%. This is because most legit msgs don't have any http thingies, so they get classified correctly as ham (no tokens at all are generated for them). This caught at least one spam in the ham corpus (a bogus "false positive"):
Data/Ham/Set2/8695.txt prob = 0.999997392672 prob('url0:240') = 0.2 prob('url1:') = 0.612567 prob('url0:250') = 0.99 prob('url0:225') = 0.99 prob('url0:207') = 0.99 Sweet XXX!
An example of a real false positive was due to /F including this URL:
prob('url0:132') = 0.99 prob('url0:telia') = 0.99
so there was significant spam with "132" and "telia" in the first field of an http thingie.
The false negative rate when tokenizing only http thingies zoomed to over 30%. Curiously, the best way for a spam to evade this check is *not* to disguise itself with numeric IPs. Numbers end up looking suspicious. But, e.g., this looks netural:
prob('url0:com') = 0.658328
and it never saw "shocking-incest" before.