[Spambayes] Watch out for digests...
tim.one at comcast.net
Sun Dec 14 22:09:04 EST 2003
> I guess my operational definitions of "synthetic" and "natural"
> tokens are in order:
> "natural tokens" are those which derive simply by splitting the
> message body on whitespace boundaries.
> "synthetic tokens" are those which are not "natural tokens".
OK. Now I've forgotten why you drew the distinction to begin with <0.9
[about busting apart IP addrs]
> I don't know. I agree those look backwards (that's my mail server,
> BTW). OTOH, given the fairly random assignment of IP networks, I
> doubt it makes much sense for the above IP address to be stripped of
> more than the last two octets ("received:220.127.116.11",
> "received:199.249.165" and "received:199.249"). "recevied:199",
> where 199 is the first octet, not the last, almost certainly means
> nothing. If it's spammy or hammy, it's just by sheer coincidence.
In that case, the database will learn it; since it can't generate more than
126 legitimate "Class A" tokens total, it's a trivial database burden.
OTOH, for someone in the DOD, it may be valuable to know that email came
from a DOD Class A network. On the third hand, spammers often forge
Received headers, and I doubt most do research to forge sensible IPs. IOW,
the system learns what does and doesn't work, in both directions, provided
only that it's shown potentially interesting stuff.
> Again, no more general than the first two octets (a class B network).
> Class A networks are very rare (for obvious reasons):
They're rarer than that now -- that's over 4 years old, and lots of those
have been busted up. Since current practice is to assign a range of initial
bits instead of initial bytes, maybe we should generate all *bit* prefixes
instead. That would sure test whether correlation is our friend <wink>.
More information about the Spambayes