<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">

  <title></title>

</head>

<body bgcolor="#ffffff" text="#333399">

Skip, I understand your concern - too much data is sometimes worse than

not enough. And I'm not a mathematician by any stretch of the

imagination. Programmer, yes, hardware designer, yes, PC guru, yes (in

another life), Geek of the week 5 times running, yes, but not

mathematician.<br>

<br>

But if I start using a tool that looks for certain words, patterns,

phrases, and so on in messages in order to identify similar messages in

the future, then it's really counter-intuitive to say "don't train it

so much, you'll break it." Why not?<br>

<br>

The problem I've run into (every day now) is that no matter how much or

little training I used, SB/TB doesn't work the way everyone tells me a

bayesian filter is supposed to. And when I try to give examples, people

say "oh, that's not the way it works!". What way does it work, then?

I'm either training it too much, or not enough!<br>

<br>

If I look at the scores given by Spambayes and Thunderbayes to words

like penis, viagra, etc, they are ridiculously low compared to words

like "sell" or "friend" or "the" - when those 'bad' words don't occur -

<i>ever - </i>in good messages! Never, ever, ever. So the technology

should be able to look at those words and say "hey, this word only ever

appears in bad messages, so I'm going to weight it like hell and mark

it as bad!", without me needing to train it. That didn't happen.<br>

<br>

By way of example (and this probably says as much about my lack of

mathematical knowledge as anything else!):<br>

<br>

I had a "penis" list of messages. These messages all contained the word

"penis". I had 1,000 of these messages, which typically consisted of

20-150 words each, with a lexical word base of about 150 words. Then, I

had a list of identical messages, with the word "penis" taken out. So I

reset SB (completely), then trained it on the bad messages as spam, and

the "edited" messages as ham. When I fed it a new message, identical to

another of the previously trained bad messages, it scored it as 8%. How

in the name of the Oxford Abridged Dictionary can that be calculated as

right?<br>

<br>

I admit&nbsp; this isn't the way SB should be used. But if it can't

distinguish such a fundamentally simple concept - out of 150 words, 149

are common, and the one "uncommon" word - which appeared 1,000 times

only in bad messages - was weighted at about 1 chance in 16 of being

"bad"? And the word "yesterday" was weighted at 22%. That's the kind of

number that makes me say "this tool doesn't work". More training

doesn't fix the problems, but I thought it would. That's my mistake!<br>

<br>

OK, there are other considerations, such as the mail headers, character

formatting, misspellings, character sets, and so on - but there is no

easy (or difficult, for that matter) to configure ANY bayesian filter

I've EVER seen to work the way I need it. I can't seem to tell it to

ignore headers - and I know for a fact that header data is taken into

account when training - but since so much of the header data (taken as

individual words) is <b>identical</b>, where's the tool to let me tell

SB not to score "x-Mozilla-Status" or "Envelope-to" in every single

damn piece of email I get? But if I look at the training dbs (assuming

I can find them and see what's in them), I find those same terms are

weighted/scored just as high as "bad" words!<br>

<br>

&nbsp;So what I'm narked off about (and don't misunderstand me, I'm

absolutely frustrated as hell at wasting close to 100 hours with

installing, training, resetting uninstalling, reinstalling, retraining,

and then rescuing email from the junk pile and simultaneously manually

putting spam into the training folder or the junk folder, and repeat)

is the fact that the <b>technology </b>doesn't work the way it's

supposed to. Not for me, anyway. And I'm a poster boy for nerds,

believe me.<br>

<br>

Maybe the problem is SB is trying to be all things to all users. But

what I've learned from the last 5 months is that the SB and TB tools

are not easy or friendly or configurable. They're difficult to get

going, difficult to maintain, unable to be truly configured to specific

needs (that are actually common across every smtp/pop email client on

every OS), and they don't like not being trained almost as much as they

hate being over trained!<br>

<br>

The one time I tried to "get into" a spam table (after exporting it to

XML and re-importing it with some words weighted more heavily), it

completely broke TB - but not because of the weighting I used, it was

because the XML import filter doesn't actually import valid XML, and

the export XML filter doesn't export valid XML.<br>

<br>

Yep, I'm sticking to what works for me until I have enough free time to

try and add something to the bayes community. I really do want to try

and improve the bayes filter technology, because it should work better.<br>

<br>

Thomas Hruska wrote:

<blockquote cite="mid:475AD7B7.4070708@cubiclesoft.com" type="cite"><a class="moz-txt-link-abbreviated" href="mailto:skip@pobox.com">skip@pobox.com</a>

wrote:

  <br>

  <blockquote type="cite">&nbsp;&nbsp;&nbsp; Pete&gt; Way too many false negatives

(still running at around 7%, after

    <br>

&nbsp;&nbsp;&nbsp; Pete&gt; 13,000+ spam training messages and 50,000+ good training

    <br>

&nbsp;&nbsp;&nbsp; Pete&gt; messages), <br>

Way too large a database.&nbsp; Train on just mistakes and unsures.&nbsp; If

you've

    <br>

trained on over 60,000 messages you must be training on everything you

    <br>

receive.

    <br>

    <br>

Good luck with K9.&nbsp; Sounds like it's doing the trick.

    <br>

    <br>

Skip

    <br>

  </blockquote>

  <br>

This only proves that Spambayes needs an autobalancing ham/spam feature

built in by default.&nbsp; Users train on everything in the hopes of

eliminating all spam from the in-box.&nbsp; Also, by _your_ logic, the

default training mechanism in Spambayes should be to NOT train on spam.

&nbsp;VERY counter-intuitive.

  <br>

  <br>

In terms of usability, Spambayes is clearly designed "by geeks, for

geeks" but since this tool has appeared in major computing magazines

that _users_ read, the tool needs to change to fit the mindset of those

who will actually use the product.&nbsp; What Pete said has crossed my mind

quite frequently while using the tool.&nbsp; Your focus is on "training

database size" rather than the user's actual complaint:&nbsp; That the

product is not _usable_.&nbsp; I can use and understand the product only

because I'm a geek.&nbsp; However, it needs to be significantly simplified

so a user can use it.

  <br>

  <br>

Maybe your goal is only to cater to geeks.&nbsp; If that's the case, you

need to state it somewhere at the top of your homepage and drop the

support for the Outlook add-in - at which point I too will probably

stop using the tool because there is no hope for it...ever,&nbsp; Users will

not take the time to learn to use the tool how you want them to use

it.&nbsp; If they see spam, they are going to train it as spam no matter how

large their training database gets.&nbsp; That is how users think.&nbsp;

Developing software is more about psychology than code:&nbsp; Study the user

and code accordingly.

  <br>

  <br>

Sorry for the rant.&nbsp; I've been feeling the same way as Pete and wanted

to put what he said into a little different perspective - perhaps one

that you'd understand better.&nbsp; You completely ignored Pete's very

lengthy e-mail on what it means to be a user of Spambayes from his

perspective and instantly focused on the one sentence that is useless

to him but is "comfortable" for you.&nbsp; I'm hoping this helps craft an

improved tool rather than write Pete, me, and other users like us off

as "annoying".&nbsp; I know that you are going to be upset when you read

this but I don't care if you hate me as long as you end up going back

and pondering Pete's e-mail.&nbsp; His words, from just that one e-mail, are

capable of guiding how Spambayes should be developed for the next 5

years.

  <br>

  <br>

</blockquote>

<br>

<div class="moz-signature">-- <br>

<meta http-equiv="CONTENT-TYPE"

 content="text/html; charset=windows-1252">

<title></title>

<meta name="GENERATOR" content="OpenOffice.org 2.3  (Win32)">

<meta name="AUTHOR" content="Peter Naus">

<meta name="CREATED" content="20071202;16483703">

<meta name="CHANGEDBY" content="Peter Naus">

<meta name="CHANGED" content="20071202;17003064">

<style type="text/css">

        <!--

                @page { size: 21cm 29.7cm; margin: 2cm }

                P { margin-bottom: 0.21cm }

        -->

        </style>

<p style="margin-bottom: 0cm;"><br>

</p>

<p style="margin-bottom: 0cm;"><br>

</p>

<p style="margin-bottom: 0cm;" align="left"><font size="1"><b>Peter Naus</b></font></p>

<p style="margin-bottom: 0cm;" align="left"><font color="#008000"><font

 size="2"><i>Audio

Engineering Manager</i></font></font></p>

<p style="margin-bottom: 0cm;" align="left"><font color="#008080"><font

 face="Microsoft Sans Serif"><font size="4"><b>Audiography</b></font></font></font></p>

<p style="margin-bottom: 0cm;" align="left"><font face="Arial"><font

 size="1">20

Churinga Avenue, Mitcham. Victoria. 3132. Australia.</font></font></p>

<p style="margin-bottom: 0cm;" align="left"><font color="#ff0000"><font

 face="Arial"><font size="1"><b>Freecall:&nbsp;1300

78 4576</b></font></font></font></p>

<p style="margin-bottom: 0cm;" align="left"><b><font color="#ff0000"><font

 face="Arial"><font size="1">Phone:</font></font></font><font

 color="#ff0000">

</font><font color="#ff0000"><font face="Arial"><font size="1">+613

8802-4562</font></font></font></b></p>

<p style="margin-bottom: 0cm;" align="left"><font color="#ff0000"><font

 face="ari"><font size="1"><b>e-mail:</b></font></font></font><font

 color="#006666"><font face="Courier New"><font size="1">&nbsp;</font></font></font><a

 href="mailto:support@audiography.com.au"><font face="Courier New"><font

 size="1">support@audiography.com.au</font></font></a></p>

<p style="margin-bottom: 0cm;" align="left"><font color="#ff0000"><font

 face="Courier New"><font size="1"><b>web:</b></font></font></font><font

 face="Courier New"><font size="1">

<a href="http://www.audiography.com.au/">http://www.audiography.com.au</a></font></font></p>

<p style="margin-bottom: 0cm;"><br>

</p>

</div>

</body>

</html>