[Spambayes] But will SpamBayes stop...

Tim Peters tim.one at comcast.net
Tue Jun 24 22:38:15 EDT 2003

[Meyer, Tony]
> Note that if it is a message with a URL, then there is a potential
> solution for this with the url following code (testtools/urlslurper.py).
> There wasn't much in the way of testing results (mine only?) and some
> question about the approach, so it's never been added to the main code,
> but it could be, if this does become more common.

I agree that was an interesting idea.  My classifiers do such a good job of
sparing me spam, though, that I have no personal motivation to test any
ideas anymore; and since testing stopped being part of my day job, it would
have to come out of spare time, of which there's approximately none.  I
still study high- and low-scoring Unsures on the wrong end, and check in
changes when they reveal a clear flaw in the tokenizer.  That seems to lead
to two checkins per year <wink>.

Note that the OP said the text was part of the image.

> If it is actually an image, I was thinking the other day about adding
> some tokens based on images.  Even relatively simple things like
> checksums can be used to distinguish messages, so there might be a way
> to tokenize the content of the image without taking too long.

I don't follow.  Extracting text from, say, .jpeg or .gif or .png files,
isn't a matter of checksumming.  Perhaps the idea is to checksum the entire
image, synthesizing a token from that in order to catch duplicate images
across spams?

> (I then gave up on the idea because I have too much to do already and
> extremely few examples of this sort of spam, and because I don't have
> *any* examples of ham like this).

I bet you'll find more examples of this sort of spam than you suspect you
have if you dig into your *correctly* classified spam msgs.  You'll find
examples of ham like this if you become Asian <wink>:  legit commerical
Asian ham sometimes just includes a URL because they can't rely on
American-written browsers to display their character sets correctly.

