Don't count words multiple times, and you'll probably get fewer false positives. That's the main reason I don't do it-- because it magnifies the effect of some random word like water happening to have a big spam probability. (Incidentally, why so high? In my db it's only 0.3930784.) --pg --Tim Peters wrote:
FYI. After cleaning the blatant spam identified by the classifier out of my ham corpus, and replacing it with new random msgs from Barry's corpus, the reported false positive rate fell to about 0.2% (averaging 8 per each batch of 4000 ham test messages). This seems remarkable given that it's ignoring headers, and just splitting the raw text on whitespace in total ignorance of HTML & MIME etc.
'FREE' (all caps) moved into the ranks of best spam indicators. The false negative rate got reduced by a small amount, but I doubt it's a statistically significant reduction (I'll compute that stuff later; I'm looking for Big Things now).
Some of these false positives are almost certainly spam, and at least one is almost certainly a virus: these are msgs that are 100% base64-encoded, or maximally obfuscated quoted-printable. That could almost certainly be fixed by, e.g., decoding encoded text.
The other false positives seem harder to deal with:
+ Brief HMTL msgs from newbies. I doubt the headers will help these get through, as they're generally first-time posters, and aren't replies to earlier msgs. There's little positive content, while all elements of raw HTML have high "it's spam" probability.
Example:
""" --------------=_4D4800B7C99C4331D7B8 Content-Description: filename="text1.txt" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Is there a version of Python with Prolog Extension?? Where can I find it if there is?
Thanks, Luis.
P.S. Could you please reply to the sender too.
--------------=_4D4800B7C99C4331D7B8 Content-Description: filename="text1.html" Content-Type: text/html Content-Transfer-Encoding: quoted-printable
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <TITLE>Prolog Extension</TITLE> <META NAME=3D"GENERATOR" CONTENT=3D"StarOffice/5.1 (Linux)"> <META NAME=3D"CREATED" CONTENT=3D"19991127;12040200"> <META NAME=3D"CHANGEDBY" CONTENT=3D"Luis Cortes"> <META NAME=3D"CHANGED" CONTENT=3D"19991127;12044700"> </HEAD> <BODY> <PRE>Is there a version of Python with Prolog Extension?? Where can I find it if there is?
Thanks, Luis.
P.S. Could you please reply to the sender too.</PRE> </BODY> </HTML>
--------------=_4D4800B7C99C4331D7B8--""" """
Here's how it got scored:
prob = 0.999958816093 prob('<META') = 0.957529 prob('<META') = 0.957529 prob('<META') = 0.957529 prob('<BODY>') = 0.979284 prob('Prolog') = 0.01 prob('<HEAD>') = 0.97989 prob('Thanks,') = 0.0337316 prob('Prolog') = 0.01 prob('Python') = 0.01 prob('NAME=3D"GENERATOR"') = 0.99 prob('<HTML>') = 0.99 prob('</HTML>') = 0.989494 prob('</BODY>') = 0.987429 prob('Thanks,') = 0.0337316 prob('Python') = 0.01
Note that '<META' gets penalized 3 times. More on that later.
+ Msgs talking *about* HTML, and including HTML in examples. This one may be troublesome, but there are mercifully few of them.
+ Brief msgs with obnoxious employer-generated signatures. Example:
""" Hi there,
I am looking for you recommendations on training courses available in the UK on Python. Can you help?
Thanks,
Vickie Mills IS Training Analyst
Tel: 0131 245 1127 Fax: 0131 245 1550 E-mail: vickie_mills@standardlife.com
For more information on Standard Life, visit our website http://www.standardlife.com/ The Standard Life Assurance Company, Standard Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland (No SZ4) and regulated by the Personal Investment Authority. Tel: 0131 225 2552 - calls may be recorded or monitored. This confidential e-mail is for the addressee only. If received in error, do not retain/copy/disclose it without our consent and please return it to us. We virus scan all e-mails but are not responsible for any damage caused by a virus or alteration by a third party after it is sent. """
The scoring:
prob = 0.98654879055 prob('our') = 0.928936 prob('sent.') = 0.939891 prob('Tel:') = 0.0620155 prob('Thanks,') = 0.0337316 prob('received') = 0.940256 prob('Tel:') = 0.0620155 prob('Hi') = 0.0533333 prob('help?') = 0.01 prob('Personal') = 0.970976 prob('regulated') = 0.99 prob('Road,') = 0.01 prob('Training') = 0.99 prob('e-mails') = 0.987542 prob('Python.') = 0.01 prob('Investment') = 0.99
The brief human-written part is fine, but the longer boilerplate sig is indistinguishable from spam.
+ The occassional non-Python conference announcement(!). These are long, so I'll skip an example. In effect, it's automated bulk email trying to sell you a conference, so is prone to use the language and artifacts of advertising. Here's typical scoring, for the TOOLS Europe '99 conference announcement:
prob = 0.983583974285 prob('THE') = 0.983584 prob('Object') = 0.01 prob('Bell') = 0.01 prob('Object-Oriented') = 0.01 prob('**************************************************************') = 0.99 prob('Bertrand') = 0.01 prob('Rational') = 0.01 prob('object-oriented') = 0.01 prob('CONTACT') = 0.99 prob('**************************************************************') = 0.99 prob('innovative') = 0.99 prob('**************************************************************') = 0.99 prob('Olivier') = 0.01 prob('VISIT') = 0.99 prob('OUR') = 0.99
Note the repeated penalty for the lines of asterisks. That segues into the next one:
+ Artifacts of that the algorithm counts multiples instances of "a word" multiple times. These are baffling at first sight! The two clearest examples:
"""
Can you create and use new files with dbhash.open()?
Yes. But if I run db_dump on these files, it says "unexpected file type or format", regardless which db_dump version I use (2.0.77, 3.0.55, 3.1.17)
It may be that db_dump isn't compatible with version 1.85 databse files. I can't remember. I seem to recall that there was an option to build 1.85 versions of db_dump and db_load. Check the configure options for BerkeleyDB to find out. (Also, while you are there, make sure that BerkeleyDB was built the same on both of your platforms...)
Try running db_verify (one of the utilities built when you compiled DB) on the file and see what it tells you.
There is no db_verify among my Berkeley DB utilities.
There should have been a bunch of them built when you compiled DB. I've got these: