[Spambayes] A couple of small tokenizer experiments.
Anthony Baxter
anthony@interlink.com.au
Tue Nov 12 01:36:28 2002
>>> Tim Peters
> Can you try this again replacing "break" with "continue"? I can't believe
> you intended break here -- it means that the first time we see a Mailman URL
> in a msg, we stop looking for embedded URLs period. Spam could easily
> exploit that.
Woopsie. I knew that :)
> >> ham:spam: 11192:1826
> >> 11192:1826
>
> You realize you've get a very high ratio of ham to spam, right?
*nod* It's my full personal test corpus. There's another 600 spam
that haven't been dropped in. I'm re-running tests at the moment
with smaller amounts.
> We don't tokenize To: now because it gives good results for bad reasons on
> mixed-source corpora. It would be good to have an option to tokenize it.
> It appears that your code also tokenized Cc:; also fine. I would rather see
> the code added to the loop currently cracking "from" lines:
I've done this now, and am testing it before checking it in.
> Why is this tokenzing only "the first" piece of the Subject line?
Thinko.
> I changed this to loop over all the Subject parts, and saw some minor good
> effects on marginal msgs, so I'll check this one in without further ado. It
> wasn't much of a win for you either, but it's cheap so why not. In my
> personal email "subjectcharset:unknown" shows up a lot for some reason (but
> only in spam).
Hm. Dunno about that - Barry might know under what circumstances
email package gives 'unknown' as a charset. I can't see how that
could happen.
> > I plan to try something like tokenizing the oldest three received
> > lines (to hopefully avoid the previous issues with mail.python.org
> > blowing numbers to hell) to see if that will help this one.
> Did you try that yet? I'm not replying in a timely fashion because I'm not
> interested, it's just because I'm 244 msgs behind on this mailing list alone
> now <wink/sigh>.
Not yet, no. It's on the stack.
> > A base64d MP3 spam sent via zope-dev
> > (*H* 0.993904, *S* 0.187868 = 0.0969820429397)
> > which got a bunch of hammy clues from "Subject: [Zope-dev] Re: ofpa" and
> > also the various mailman type clues (although that's better with the
> > first patch, above)
I'm going to try a patch to try and strip out mailing list [titles] at
some point, too.
Anthony
More information about the Spambayes
mailing list