Ham or Spam? (was RE: [Spambayes] RE: Central Limit Theorem??!! :))

Charles Cazabon python-spambayes@discworld.dyndns.org
Fri, 27 Sep 2002 11:34:03 -0600


Tim Peters <tim.one@comcast.net> wrote:
> [Charles Cazabon]
> > I'd be most curious as to how ham HTML messages vs. spam HTML
> > messages compare with the above scheme if you longer strip HTML tags.
> 
> "No longer", right?

Yup -- typing too fast for accuracy :).  The more typos I make, the more
interested I am in the subject.

> > I realize you don't have unlimited time for testing, but it might be
> > useful if HTML spam  message rate as "high likelihood, high
> > confidence" while HTML ham is "high likelihood, lower confidence" ...
> 
> I can run tests in the background easily enough, but this is something I
> can't test at all:  there is almost no HTML ham in c.l.py traffic.
[...]
> If anyone else can test this, be my guest.  In the absence of volunteers,
> I'll appoint Charles <wink>.

I can accumulate ham containing HTML easily enough (many of the mailing lists
I'm on get multipart/alternative messages from newbies all the time, and
html-only the odd time), but I'm afraid the machine I'd have to run this stuff
on (the mailserver) isn't up to the task -- it's an older machine, 64MB RAM
and no room for expansion.  As soon as it's driven into swap, I may as well
not be running the tests at all.

Alternately, I suppose I could try to put together a corpus for someone with
greater hardware resources to test.  To have reasonable confidence in the
results, do you need a full 2000 messages each of ham and spam?

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon                 <python-spambayes@discworld.dyndns.org>
GPL'ed software available at:     http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------