[Spambayes] More messin' around - common email prefixes
Skip Montanaro
skip at pobox.com
Sun Dec 8 04:27:55 EST 2002
I modified the tokenizer to generate tokens related to common prefixes in
email addresses. One observation several people have made is that some
spammers send out email to clumps of alphabetically similar addresses. One
spam I received recently was sent to
To: <itinerart@videotron.ca>
Cc: <itinerant@skyful.com>, <itinerant@netillusions.net>,
<itineraries@musi-cal.com>, <itinerario@rullet.leidenuniv.nl>,
<itinerance@sorengo.com>
I fooled around a bit generating tokens that take into account the length of
the common prefix and the number of recipients. I generate tokens that are
the product of the length of the common prefix and the number of recipients
divided by 10. In the above case I score it a '4' ((6 * 7) // 10). I only
generate the token if there are more than one recipient and a non-zero
common prefix.
Here's the distribution of tokens in my database (13144 hams, 6662 spams):
('pfxlen:0', (18, 209))
('pfxlen:1', (48, 32))
('pfxlen:2', (42, 10))
('pfxlen:3', (24, 2))
('pfxlen:4', (23, 0))
('pfxlen:5', (16, 0))
('pfxlen:6', (16, 0))
('pfxlen:7', (11, 0))
('pfxlen:8', (6, 0))
('pfxlen:9', (4, 0))
('pfxlen:10', (5, 0))
('pfxlen:11', (1, 0))
('pfxlen:12', (1, 0))
('pfxlen:14', (1, 0))
('pfxlen:17', (1, 0))
('pfxlen:18', (1, 0))
('pfxlen:19', (1, 0))
('pfxlen:24', (1, 0))
('pfxlen:28', (1, 0))
Not too surprisingly, higher scores are associated with spam than with ham.
This distribution suugests to me that perhaps I should squash that to two
distinct tokens, one for scores of 0 or 1, and one for all higher scores.
I'll try that out in a bit.
Skip
More information about the Spambayes
mailing list