[Spambayes] But will SpamBayes stop...
tim.one at comcast.net
Tue Jun 24 22:38:15 EDT 2003
> Note that if it is a message with a URL, then there is a potential
> solution for this with the url following code (testtools/urlslurper.py).
> There wasn't much in the way of testing results (mine only?) and some
> question about the approach, so it's never been added to the main code,
> but it could be, if this does become more common.
I agree that was an interesting idea. My classifiers do such a good job of
sparing me spam, though, that I have no personal motivation to test any
ideas anymore; and since testing stopped being part of my day job, it would
have to come out of spare time, of which there's approximately none. I
still study high- and low-scoring Unsures on the wrong end, and check in
changes when they reveal a clear flaw in the tokenizer. That seems to lead
to two checkins per year <wink>.
Note that the OP said the text was part of the image.
> If it is actually an image, I was thinking the other day about adding
> some tokens based on images. Even relatively simple things like
> checksums can be used to distinguish messages, so there might be a way
> to tokenize the content of the image without taking too long.
I don't follow. Extracting text from, say, .jpeg or .gif or .png files,
isn't a matter of checksumming. Perhaps the idea is to checksum the entire
image, synthesizing a token from that in order to catch duplicate images
> (I then gave up on the idea because I have too much to do already and
> extremely few examples of this sort of spam, and because I don't have
> *any* examples of ham like this).
I bet you'll find more examples of this sort of spam than you suspect you
have if you dig into your *correctly* classified spam msgs. You'll find
examples of ham like this if you become Asian <wink>: legit commerical
Asian ham sometimes just includes a URL because they can't rely on
American-written browsers to display their character sets correctly.
More information about the Spambayes