[Python-Dev] The first trustworthy <wink> GBayes results

Paul Graham pg@archub.org
28 Aug 2002 19:47:44 -0000


Don't count words multiple times, and you'll probably
get fewer false positives.  That's the main reason I
don't do it-- because it magnifies the effect of some 
random word like water happening to have a big spam
probability. (Incidentally, why so high?  In my db it's 
only 0.3930784.)  --pg

--Tim Peters wrote:
> FYI.  After cleaning the blatant spam identified by the classifier out of my
> ham corpus, and replacing it with new random msgs from Barry's corpus, the
> reported false positive rate fell to about 0.2% (averaging 8 per each batch
> of 4000 ham test messages).  This seems remarkable given that it's ignoring
> headers, and just splitting the raw text on whitespace in total ignorance of
> HTML & MIME etc.
> 
> 'FREE' (all caps) moved into the ranks of best spam indicators.  The false
> negative rate got reduced by a small amount, but I doubt it's a
> statistically significant reduction (I'll compute that stuff later; I'm
> looking for Big Things now).
> 
> Some of these false positives are almost certainly spam, and at least one is
> almost certainly a virus:  these are msgs that are 100% base64-encoded, or
> maximally obfuscated quoted-printable.  That could almost certainly be fixed
> by, e.g., decoding encoded text.
> 
> The other false positives seem harder to deal with:
> 
> + Brief HMTL msgs from newbies.  I doubt the headers will help these
>   get through, as they're generally first-time posters, and aren't
>   replies to earlier msgs.  There's little positive content, while
>   all elements of raw HTML have high "it's spam" probability.
> 
> Example:
> 
> """
> --------------=_4D4800B7C99C4331D7B8
> Content-Description: filename="text1.txt"
> Content-Type: text/plain; charset=ISO-8859-1
> Content-Transfer-Encoding: quoted-printable
> 
> Is there a version of Python with Prolog Extension??
> Where can I find it if there is?
> 
> Thanks,
> Luis.
> 
> P.S. Could you please reply to the sender too.
> 
> 
> --------------=_4D4800B7C99C4331D7B8
> Content-Description: filename="text1.html"
> Content-Type: text/html
> Content-Transfer-Encoding: quoted-printable
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
> <HTML>
> <HEAD>
>         <TITLE>Prolog Extension</TITLE>
>         <META NAME=3D"GENERATOR" CONTENT=3D"StarOffice/5.1 (Linux)">
>         <META NAME=3D"CREATED" CONTENT=3D"19991127;12040200">
>         <META NAME=3D"CHANGEDBY" CONTENT=3D"Luis Cortes">
>         <META NAME=3D"CHANGED" CONTENT=3D"19991127;12044700">
> </HEAD>
> <BODY>
> <PRE>Is there a version of Python with Prolog Extension??
> Where can I find it if there is?
> 
> Thanks,
> Luis.
> 
> P.S. Could you please reply to the sender too.</PRE>
> </BODY>
> </HTML>
> 
> --------------=_4D4800B7C99C4331D7B8--"""
> """
> 
> Here's how it got scored:
> 
> prob = 0.999958816093
> prob('<META') = 0.957529
> prob('<META') = 0.957529
> prob('<META') = 0.957529
> prob('<BODY>') = 0.979284
> prob('Prolog') = 0.01
> prob('<HEAD>') = 0.97989
> prob('Thanks,') = 0.0337316
> prob('Prolog') = 0.01
> prob('Python') = 0.01
> prob('NAME=3D"GENERATOR"') = 0.99
> prob('<HTML>') = 0.99
> prob('</HTML>') = 0.989494
> prob('</BODY>') = 0.987429
> prob('Thanks,') = 0.0337316
> prob('Python') = 0.01
> 
> Note that '<META' gets penalized 3 times.  More on that later.
> 
> + Msgs talking *about* HTML, and including HTML in examples.  This one
>   may be troublesome, but there are mercifully few of them.
> 
> + Brief msgs with obnoxious employer-generated signatures.  Example:
> 
> """
> Hi there,
> 
> I am looking for you recommendations on training courses available in the UK
> on Python.  Can you help?
> 
> Thanks,
> 
> Vickie Mills
> IS Training Analyst
> 
> Tel:    0131 245 1127
> Fax:    0131 245 1550
> E-mail:    vickie_mills@standardlife.com
> 
> For more information on Standard Life, visit our website
> http://www.standardlife.com/   The Standard Life Assurance Company, Standard
> Life House, 30 Lothian Road, Edinburgh EH1 2DH, is registered in Scotland
> (No SZ4) and regulated by the Personal Investment Authority.  Tel: 0131 225
> 2552 - calls may be recorded or monitored.  This confidential e-mail is for
> the addressee only.  If received in error, do not retain/copy/disclose it
> without our consent and please return it to us.  We virus scan all e-mails
> but are not responsible for any damage caused by a virus or alteration by a
> third party after it is sent.
> """
> 
> The scoring:
> 
> prob = 0.98654879055
> prob('our') = 0.928936
> prob('sent.') = 0.939891
> prob('Tel:') = 0.0620155
> prob('Thanks,') = 0.0337316
> prob('received') = 0.940256
> prob('Tel:') = 0.0620155
> prob('Hi') = 0.0533333
> prob('help?') = 0.01
> prob('Personal') = 0.970976
> prob('regulated') = 0.99
> prob('Road,') = 0.01
> prob('Training') = 0.99
> prob('e-mails') = 0.987542
> prob('Python.') = 0.01
> prob('Investment') = 0.99
> 
> The brief human-written part is fine, but the longer boilerplate sig is
> indistinguishable from spam.
> 
> + The occassional non-Python conference announcement(!).  These are
>   long, so I'll skip an example.  In effect, it's automated bulk email
>   trying to sell you a conference, so is prone to use the language and
>   artifacts of advertising.  Here's typical scoring, for the TOOLS
>   Europe '99 conference announcement:
> 
> prob = 0.983583974285
> prob('THE') = 0.983584
> prob('Object') = 0.01
> prob('Bell') = 0.01
> prob('Object-Oriented') = 0.01
> prob('**************************************************************') =
> 0.99
> prob('Bertrand') = 0.01
> prob('Rational') = 0.01
> prob('object-oriented') = 0.01
> prob('CONTACT') = 0.99
> prob('**************************************************************') =
> 0.99
> prob('innovative') = 0.99
> prob('**************************************************************') =
> 0.99
> prob('Olivier') = 0.01
> prob('VISIT') = 0.99
> prob('OUR') = 0.99
> 
> Note the repeated penalty for the lines of asterisks.  That segues into the
> next one:
> 
> + Artifacts of that the algorithm counts multiples instances of "a word"
>   multiple times.  These are baffling at first sight!  The two clearest
>   examples:
> 
> """
> > > Can you create and use new files with dbhash.open()?
> >
> > Yes. But if I run db_dump on these files, it says "unexpected file type
> > or format", regardless which db_dump version I use (2.0.77, 3.0.55,
> > 3.1.17)
> >
> 
> It may be that db_dump isn't compatible with version 1.85 databse files.  I
> can't remember.  I seem to recall that there was an option to build 1.85
> versions of db_dump and db_load.  Check the configure options for
> BerkeleyDB to find out.  (Also, while you are there, make sure that
> BerkeleyDB was built the same on both of your platforms...)
> 
> 
> >
> > >  Try running db_verify (one of the utilities built
> > > when you compiled DB) on the file and see what it tells you.
> >
> > There is no db_verify among my Berkeley DB utilities.
> 
> There should have been a bunch of them built when you compiled DB.  I've got
> these:
>