[Spambayes] A couple of small tokenizer experiments.

Tim Peters tim.one@comcast.net
Mon Nov 11 23:24:09 2002


[Anthony Baxter
 Sent: Monday, November 04, 2002 4:51 AM
]
> First experiment was to make the URL tokenizer look for the string
> 'mailman' in the URL. If it was found, simple push the clue "url:
> Mailman URL" onto the clue-pile. This was an attempt to remove the
> many many related clues that get bolted onto the occasional spam that
> makes it past Greg to the python.org mailservers. It's something of a
> violation of "stupid beats smart", but I'd noticed that the mailman
> footer from spam via mailman lists was always providing a bunch of
> clues that were making life harder.

Indeed they do.

>
> --- tokenizer.py        1 Nov 2002 16:10:13 -0000       1.60
> +++ tokenizer.py        4 Nov 2002 06:59:37 -0000
> @@ -931,6 +931,11 @@
>          new_text.append(text[i : start])
>          new_text.append(' ')
>
> +        if guts.find('mailman') != -1:
> +            pushclue("url: Mailman URL")
> +            i = end
> +            break

Can you try this again replacing "break" with "continue"?  I can't believe
you intended break here -- it means that the first time we see a Mailman URL
in a msg, we stop looking for embedded URLs period.  Spam could easily
exploit that.

>> ham:spam:  11192:1826
>>                   11192:1826

You realize you've get a very high ratio of ham to spam, right?

> ...
> Next I tried tokenizing the To: line.  I parsed it properly, then
> decoded the real name and split the words. I also added a token for
> the RHS and LHS of the email @ sign.

We don't tokenize To: now because it gives good results for bad reasons on
mixed-source corpora.  It would be good to have an option to tokenize it.
It appears that your code also tokenized Cc:; also fine.  I would rather see
the code added to the loop currently cracking "from" lines:

        for field in ('from',):

so that we tokenize all address thingies in a uniform way.  The option would
control the list of field names looped over there (default just from:,
optionally also to: and cc:).

> ...
> The final test was to decode the Subject header if it's encoded, and
> tokenize that, rather than in encoded.
>
> --- tokenizer.py        1 Nov 2002 16:10:13 -0000       1.60
> +++ tokenizer.py        4 Nov 2002 09:45:25 -0000
> @@ -1071,6 +1078,10 @@
>          # especially significant in this context.  Experiment
> showed a small
>          # but real benefit to keeping case intact in this
> specific context.
>          x = msg.get('subject', '')
> +        # Subject decoding.
> +        x, subjcharset = email.Header.decode_header(x)[0]

Why is this tokenzing only "the first" piece of the Subject line?


> +        if subjcharset is not None:
> +            yield 'subjectcharset:' + subjcharset
>          for w in subject_word_re.findall(x):
>              for t in tokenize_word(w):
>                  yield 'subject:' + t


I changed this to loop over all the Subject parts, and saw some minor good
effects on marginal msgs, so I'll check this one in without further ado.  It
wasn't much of a win for you either, but it's cheap so why not.  In my
personal email "subjectcharset:unknown" shows up a lot for some reason (but
only in spam).


> My remaining 6 fns are:
>
> a brazilian spam-ish thing: (*H* 0.633859 *S* 0.20342 = 0.28478)
> ...
> -----------------
> Received: from localhost (localhost.localdomain [127.0.0.1])
>         by localhost.localdomain (8.11.6/8.11.6) with ESMTP id
> g8RNZhh05864
>         for <anthony@localhost>; Sat, 28 Sep 2002 09:35:44 +1000
> Received: from mail.interlink.com.au [203.9.111.130]
>         by localhost with POP3 (fetchmail-5.9.0)
>         for anthony@localhost (single-drop); Sat, 28 Sep 2002
> 09:35:44 +1000 (ES
> T)
> Received: from mediterraneo.rjnet.com.br (root@[200.152.115.30])
>         by valdez.interlink.com.au (8.11.6/8.11.2) with ESMTP id
> g8RNZJc28230
>         for <anthony@interlink.com.au>; Sat, 28 Sep 2002 09:35:20 +1000
> Received: from locutus.rjnet.com.br (root@locutus.rjnet.com.br
> [200.222.31.10])
>         by mediterraneo.rjnet.com.br (8.11.4/8.11.4) with ESMTP
> id g8RNNc801901;
>         Fri, 27 Sep 2002 20:23:38 -0300
> Received: from localhost ([200.222.39.21])
>         by locutus.rjnet.com.br (8.11.2/8.11.2) with ESMTP id
> g8RMqEN00464;
>         Fri, 27 Sep 2002 19:52:14 -0300

> DATA
> -----------------
> I plan to try something like tokenizing the oldest three received
> lines (to hopefully avoid the previous issues with mail.python.org
> blowing numbers to hell) to see if that will help this one.

Did you try that yet?  I'm not replying in a timely fashion because I'm not
interested, it's just because I'm 244 msgs behind on this mailing list alone
now <wink/sigh>.

> The "iron citadel" python-list spam
> (*H* 0.999999, *S* 0.038123 = 0.01906)

DAMNED good spam!

> A base64d MP3 spam sent via zope-dev
> (*H* 0.993904, *S* 0.187868 = 0.0969820429397)
> which got a bunch of hammy clues from "Subject: [Zope-dev] Re: ofpa" and
> also the various mailman type clues (although that's better with the
> first patch, above)
>
> Someone spamming Linux CDs via a list at 4thought
> (*H* 1, *S* 0.207177 = 0.103588442478)
>
> A short porn spam sent via python-list
> (*H* 0.817004, *S* 0.618399 = 0.400697521022)
>
> A wierd german spam for some sort of expert systems (in english).
> (*H* 0.997132, *S* 0.84965 = 0.426259133645)

It's Weird that you have cutoffs arranged such that a number near .40 isn't
Unsure for you.  That may (or may not) be related to the lopsidedness of
your data (> 6 ham per spam).




More information about the Spambayes mailing list