[Spambayes] test sets?

Tim Peters tim.one@comcast.net
Sun, 08 Sep 2002 20:00:55 -0400


[Tim, to Anthony]
>> So you know whether src added additional power, or did you do both at
>> once?

[Anthony Baxter]
> Both at once. I added it because <iframe src=cid:foofoofoo> is such a
> killer detector of spam/viruses, also because I got a bunch of email
> spam that was just
>
>     <img src="bozo.bozo.kr/img34532.jpg">
>     <img src="bozo.bozo.kr/img34512.jpg">
>     <img src="bozo.bozo.kr/img34237.jpg">
>     <img src="bozo.bozo.kr/img34914.jpg">

I added src= tokenization, and the result was identical f-n and f-p rates
across all 20 of my "standard runs":  0 differences.  I suspect this is
because img tags *really* look more like

     <img src="http://bozo.bozo.kr/img34914.jpg">

and our tokenizer was already picking up http thingies regardless of their
context.

> I'm already stripping out HTML tags - it was producing far far too
> many false positives with my corpus with them in. Without the src/hrefs
> these spams were pretty much null and void.

I'd like to strip them too <wink>; but see earlier recent msgs about that.