[Python-Dev] Mining URLs for spam detection

Fri, 30 Aug 2002 15:41:51 -0400

I've gotten interesting results from this gimmick:

import re
url_re = re.compile(r"http://([^\s>'\"\x7f-\xff]+)", re.IGNORECASE)
urlfield_re = re.compile(r"[;?:@&=+,$.]")

def tokenize_url(string):
    for url in url_re.findall(string):
        for i, piece in enumerate(url.lower().split('/')):
            prefix = "url%d:" % i
            for chunk in urlfield_re.split(piece):
                yield prefix + chunk
    ... (and then do other tokenization) ...

So it splits a case-normalized http thingie via /, tags the first piece
"url0:", the second "url1:", and so on.  Within each piece, it splits on
separators, like '=' and '.'.

Two particular tokens generated this way then made it into the list of 15
words that most often survived to the end of the scoring step:

    url0:python   as a strong non-spam indicator
    url1:remove   as a strong spam indicator

The rest of the tokenization was unchanged, still doing MIME-ignorant
splitting on whitespace.  Just the http gimmick was added, and that alone
cut the false negative rate in half.  IOW, there's a *lot* of valuable info
in the http thingies!  Not being a Web Guy, I'm not sure how to extract the
most info from it.  If you've got suggestions for a better URL tagging
strategy, I'd love to hear them.

Cute:  If I tokenize *only* the http thingies, ignoring all other parts of
the text, the false positive rate is about 1%.  This is because most legit
msgs don't have any http thingies, so they get classified correctly as ham
(no tokens at all are generated for them).   This caught at least one spam
in the ham corpus (a bogus "false positive"):

Data/Ham/Set2/8695.txt
prob = 0.999997392672
prob('url0:240') = 0.2
prob('url1:') = 0.612567
prob('url0:250') = 0.99
prob('url0:225') = 0.99
prob('url0:207') = 0.99
Sweet XXX!

http://207.240.225.250/
II33bp-]

An example of a real false positive was due to /F including this URL:

    http://w1.132.telia.com/~u13208596/temp/py15-980706.zip

Oddly enough,

    prob('url0:132') = 0.99
    prob('url0:telia') = 0.99

so there was significant spam with "132" and "telia" in the first field of
an http thingie.

The false negative rate when tokenizing only http thingies zoomed to over
30%.  Curiously, the best way for a spam to evade this check is *not* to
disguise itself with numeric IPs.  Numbers end up looking suspicious.  But,
e.g., this looks netural:

    http://shocking-incest.com

    prob('url0:com') = 0.658328

and it never saw "shocking-incest" before.