[Spambayes] RE: About my Anti Phishing suggestions for Spambayes

Tony Meyer tameyer at ihug.co.nz
Wed Jan 19 02:02:04 CET 2005


> I am sorry that I had kind of "bugged" you when I did
> not get an immediate response from you about my Phishing
> suggestion. 

I hope it didn't come out like I was complaining or trying to tell you off
or anything.  I was just trying to explain the delay.

> I am grateful that you did respond even though you still
> had 268 items on your to do list.

Down to 259 now...some are hopelessly out of date, unfortunately (one reason
it is better to report bugs (rather than ask questions) via the sourceforge
system, since at least those are tracked and will eventually be dealt to).
I wish I could find the time to really answer all the mail that needs
answering, but I just don't have it available.  I hate working through the
backlog and finding problem-solving sessions that I never managed to
complete.

> Anyway, this has 2 very high spam tokens of URL:157 and URL:202 with
> ham 0 and spam 5.  But I noticed that all 3 of the PayPal hidden
> URLs had different IP addresses.  I read somewhere that the Phishers
> only use a given IP address for a few days (I assume they are then shut
> down by the authorities).  If I understand you correctly, you are
> thinking that the URL:NNN tokens will get marked as Spam, but it will
> take quite some time for that to happen since there are 256 different
> (maybe eventually 65536) allowed values.

My understanding of the workings of IP addresses is limited, but doesn't the
fact that spammers (particularly phishers, probably) change IP addresses so
quickly mean that splitting the address is better?  If they then switch from
(eg) 130.123.238.51 to 130.123.238.141 at least some of the tokens will
still be of use.  I'm guessing that the addresses will be at least somewhat
similar (those two are machines on my network here), which might not be
true, but it's possible.

> I do understand that the URL: NNN idea needs to be tested.  I doubt
> that I could do such a test, so I guess it will just have to be left
> for some time in the far or near future for some Spambayes developer to
> try.

One of the main problems here is that in the early SpamBayes days there was
a intense amount of testing different tokenization and classification ideas,
with lots of different people (which helps rule out things that just suit a
particular corpus).  That sort of testing is somewhat rare these days, so
it's uncommon for an idea to get testing results from more than one person
(or maybe two people).  If it does well with that much, then it might get
added as an experimental option, but that's a big if.

It wouldn't take much for me to code the simpler versions of this idea (the
lookup variations might be trickier, although probably not, but they would
take longer to test).  But since I don't experience this problem in my own
mail, I'm highly unlikely to get promising testing results.  I could add in
mail like the examples you gave me to my own mail, but I'd rather avoid a
mixed corpus.

Anyway, it is on the list in a way (you could add it to the ideas at
<http://entrian.com/sbwiki> if you like) and will probably be tried at some
point.  Feel free to suggest more ideas, too!

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.



More information about the Spambayes mailing list