[Python-Dev] The first trustworthy <wink> GBayes results
Paul Graham
pg@archub.org
28 Aug 2002 19:47:44 -0000
Don't count words multiple times, and you'll probably
get fewer false positives. That's the main reason I
don't do it-- because it magnifies the effect of some
random word like water happening to have a big spam
probability. (Incidentally, why so high? In my db it's
only 0.3930784.) --pg
--Tim Peters wrote:
> FYI. After cleaning the blatant spam identified by the classifier out of my
> ham corpus, and replacing it with new random msgs from Barry's corpus, the
> reported false positive rate fell to about 0.2% (averaging 8 per each batch
> of 4000 ham test messages). This seems remarkable given that it's ignoring
> headers, and just splitting the raw text on whitespace in total ignorance of
> HTML & MIME etc.
>
> 'FREE' (all caps) moved into the ranks of best spam indicators. The false
> negative rate got reduced by a small amount, but I doubt it's a
> statistically significant reduction (I'll compute that stuff later; I'm
> looking for Big Things now).
>
> Some of these false positives are almost certainly spam, and at least one is
> almost certainly a virus: these are msgs that are 100% base64-encoded, or
> maximally obfuscated quoted-printable. That could almost certainly be fixed
> by, e.g., decoding encoded text.
>
> The other false positives seem harder to deal with:
>
> + Brief HMTL msgs from newbies. I doubt the headers will help these
> get through, as they're generally first-time posters, and aren't
> replies to earlier msgs. There's little positive content, while
> all elements of raw HTML have high "it's spam" probability.
>
> Example:
>
> """
> --------------=_4D4800B7C99C4331D7B8
> Content-Description: filename="text1.txt"
> Content-Type: text/plain; charset=ISO-8859-1
> Content-Transfer-Encoding: quoted-printable
>
> Is there a version of Python with Prolog Extension??
> Where can I find it if there is?
>
> Thanks,
> Luis.
>
> P.S. Could you please reply to the sender too.
>
>
> --------------=_4D4800B7C99C4331D7B8
> Content-Description: filename="text1.html"
> Content-Type: text/html
> Content-Transfer-Encoding: quoted-printable
>
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
> <HTML>
> <HEAD>
> <TITLE>Prolog Extension</TITLE>
> <META NAME=3D"GENERATOR" CONTENT=3D"StarOffice/5.1 (Linux)">
> <META NAME=3D"CREATED" CONTENT=3D"19991127;12040200">
> <META NAME=3D"CHANGEDBY" CONTENT=3D"Luis Cortes">
> <META NAME=3D"CHANGED" CONTENT=3D"19991127;12044700">
> </HEAD>
> <BODY>
> <PRE>Is there a version of Python with Prolog Extension??
> Where can I find it if there is?
>
> Thanks,
> Luis.
>
> P.S. Could you please reply to the sender too.</PRE>
> </BODY>
> </HTML>
>
> --------------=_4D4800B7C99C4331D7B8--"""
> """
>
> Here's how it got scored:
>
> prob = 0.999958816093
> prob('<META') = 0.957529
> prob('<META') = 0.957529
> prob('<META') = 0.957529
> prob('<BODY>') = 0.979284
> prob('Prolog') = 0.01
> prob('<HEAD>') = 0.97989
> prob('Thanks,') = 0.0337316
> prob('Prolog') = 0.01
> prob('Python') = 0.01
> prob('NAME=3D"GENERATOR"') = 0.99
> prob('<HTML>') = 0.99
> prob('</HTML>') = 0.989494
> prob('</BODY>') = 0.987429
> prob('Thanks,') = 0.0337316
> prob('Python') = 0.01
>
> Note that '<META' gets penalized 3 times. More on that later.
>
> + Msgs talking *about* HTML, and including HTML in examples. This one
> may be troublesome, but there are mercifully few of them.
>
> + Brief msgs with obnoxious employer-generated signatures. Example:
>
> """
> Hi there,
>
> I am looking for you recommendations on training courses available in the UK
> on Python. Can you help?
>
> Thanks,
>
> Vickie Mills
> IS Training Analyst
>
> Tel: 0131 245 1127
> Fax: 0131 245 1550
> E-mail: vickie_mills@standardlife.com
>
> For more information on Standard Life, visit our website
> http://www.standardlife.com/ The Standard Life Assurance Company, Standard
> Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland
> (No SZ4) and regulated by the Personal Investment Authority. Tel: 0131 225
> 2552 - calls may be recorded or monitored. This confidential e-mail is for
> the addressee only. If received in error, do not retain/copy/disclose it
> without our consent and please return it to us. We virus scan all e-mails
> but are not responsible for any damage caused by a virus or alteration by a
> third party after it is sent.
> """
>
> The scoring:
>
> prob = 0.98654879055
> prob('our') = 0.928936
> prob('sent.') = 0.939891
> prob('Tel:') = 0.0620155
> prob('Thanks,') = 0.0337316
> prob('received') = 0.940256
> prob('Tel:') = 0.0620155
> prob('Hi') = 0.0533333
> prob('help?') = 0.01
> prob('Personal') = 0.970976
> prob('regulated') = 0.99
> prob('Road,') = 0.01
> prob('Training') = 0.99
> prob('e-mails') = 0.987542
> prob('Python.') = 0.01
> prob('Investment') = 0.99
>
> The brief human-written part is fine, but the longer boilerplate sig is
> indistinguishable from spam.
>
> + The occassional non-Python conference announcement(!). These are
> long, so I'll skip an example. In effect, it's automated bulk email
> trying to sell you a conference, so is prone to use the language and
> artifacts of advertising. Here's typical scoring, for the TOOLS
> Europe '99 conference announcement:
>
> prob = 0.983583974285
> prob('THE') = 0.983584
> prob('Object') = 0.01
> prob('Bell') = 0.01
> prob('Object-Oriented') = 0.01
> prob('**************************************************************') =
> 0.99
> prob('Bertrand') = 0.01
> prob('Rational') = 0.01
> prob('object-oriented') = 0.01
> prob('CONTACT') = 0.99
> prob('**************************************************************') =
> 0.99
> prob('innovative') = 0.99
> prob('**************************************************************') =
> 0.99
> prob('Olivier') = 0.01
> prob('VISIT') = 0.99
> prob('OUR') = 0.99
>
> Note the repeated penalty for the lines of asterisks. That segues into the
> next one:
>
> + Artifacts of that the algorithm counts multiples instances of "a word"
> multiple times. These are baffling at first sight! The two clearest
> examples:
>
> """
> > > Can you create and use new files with dbhash.open()?
> >
> > Yes. But if I run db_dump on these files, it says "unexpected file type
> > or format", regardless which db_dump version I use (2.0.77, 3.0.55,
> > 3.1.17)
> >
>
> It may be that db_dump isn't compatible with version 1.85 databse files. I
> can't remember. I seem to recall that there was an option to build 1.85
> versions of db_dump and db_load. Check the configure options for
> BerkeleyDB to find out. (Also, while you are there, make sure that
> BerkeleyDB was built the same on both of your platforms...)
>
>
> >
> > > Try running db_verify (one of the utilities built
> > > when you compiled DB) on the file and see what it tells you.
> >
> > There is no db_verify among my Berkeley DB utilities.
>
> There should have been a bunch of them built when you compiled DB. I've got
> these:
>