[Spambayes] Watch out for digests...

Tim Peters tim.one at comcast.net
Sun Dec 14 22:09:04 EST 2003

[Skip Montanaro]
> I guess my operational definitions of "synthetic" and "natural"
> tokens are in order:
>     "natural tokens" are those which derive simply by splitting the
>     message body on whitespace boundaries.
>     "synthetic tokens" are those which are not "natural tokens".

OK.  Now I've forgotten why you drew the distinction to begin with <0.9

[about busting apart IP addrs]
> I don't know.  I agree those look backwards (that's my mail server,
> BTW). OTOH, given the fairly random assignment of IP networks, I
> doubt it makes much sense for the above IP address to be stripped of
> more than the last two octets ("received:",
> "received:199.249.165" and "received:199.249").  "recevied:199",
> where 199 is the first octet, not the last, almost certainly means
> nothing.  If it's spammy or hammy, it's just by sheer coincidence.

In that case, the database will learn it; since it can't generate more than
126 legitimate "Class A" tokens total, it's a trivial database burden.
OTOH, for someone in the DOD, it may be valuable to know that email came
from a DOD Class A network.  On the third hand, spammers often forge
Received headers, and I doubt most do research to forge sensible IPs.  IOW,
the system learns what does and doesn't work, in both directions, provided
only that it's shown potentially interesting stuff.

> ...
> Again, no more general than the first two octets (a class B network).
> Class A networks are very rare (for obvious reasons):
>     http://euclid.math.brandeis.edu/turtschi/whois/neta1.html

They're rarer than that now -- that's over 4 years old, and lots of those
have been busted up.  Since current practice is to assign a range of initial
bits instead of initial bytes, maybe we should generate all *bit* prefixes
instead.  That would sure test whether correlation is our friend <wink>.

More information about the Spambayes mailing list