Ham or Spam? (was RE: [Spambayes] RE: Central Limit Theorem??!!
:))
Charles Cazabon
python-spambayes@discworld.dyndns.org
Fri, 27 Sep 2002 11:34:03 -0600
Tim Peters <tim.one@comcast.net> wrote:
> [Charles Cazabon]
> > I'd be most curious as to how ham HTML messages vs. spam HTML
> > messages compare with the above scheme if you longer strip HTML tags.
>
> "No longer", right?
Yup -- typing too fast for accuracy :). The more typos I make, the more
interested I am in the subject.
> > I realize you don't have unlimited time for testing, but it might be
> > useful if HTML spam message rate as "high likelihood, high
> > confidence" while HTML ham is "high likelihood, lower confidence" ...
>
> I can run tests in the background easily enough, but this is something I
> can't test at all: there is almost no HTML ham in c.l.py traffic.
[...]
> If anyone else can test this, be my guest. In the absence of volunteers,
> I'll appoint Charles <wink>.
I can accumulate ham containing HTML easily enough (many of the mailing lists
I'm on get multipart/alternative messages from newbies all the time, and
html-only the odd time), but I'm afraid the machine I'd have to run this stuff
on (the mailserver) isn't up to the task -- it's an older machine, 64MB RAM
and no room for expansion. As soon as it's driven into swap, I may as well
not be running the tests at all.
Alternately, I suppose I could try to put together a corpus for someone with
greater hardware resources to test. To have reasonable confidence in the
results, do you need a full 2000 messages each of ham and spam?
Charles
--
-----------------------------------------------------------------------
Charles Cazabon <python-spambayes@discworld.dyndns.org>
GPL'ed software available at: http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------