[Spambayes] test sets?
Tim Peters
tim.one@comcast.net
Sun, 08 Sep 2002 20:00:55 -0400
[Tim, to Anthony]
>> So you know whether src added additional power, or did you do both at
>> once?
[Anthony Baxter]
> Both at once. I added it because <iframe src=cid:foofoofoo> is such a
> killer detector of spam/viruses, also because I got a bunch of email
> spam that was just
>
> <img src="bozo.bozo.kr/img34532.jpg">
> <img src="bozo.bozo.kr/img34512.jpg">
> <img src="bozo.bozo.kr/img34237.jpg">
> <img src="bozo.bozo.kr/img34914.jpg">
I added src= tokenization, and the result was identical f-n and f-p rates
across all 20 of my "standard runs": 0 differences. I suspect this is
because img tags *really* look more like
<img src="http://bozo.bozo.kr/img34914.jpg">
and our tokenizer was already picking up http thingies regardless of their
context.
> I'm already stripping out HTML tags - it was producing far far too
> many false positives with my corpus with them in. Without the src/hrefs
> these spams were pretty much null and void.
I'd like to strip them too <wink>; but see earlier recent msgs about that.