[Spambayes] Training from scratch

Richie Hindle richie@entrian.com
Tue Nov 19 23:59:26 2002


I started a new database from scratch yesterday morning at work, and
trained it via the web interface as the messages arrived.  Courtesy of the
shiny new pop3graph.py (as yet uncommitted), this is how it behaved over
the first 36 hours:


   . - Number of messages over time
   * - Number of correctly classified messages over time


 |                                                 . 99
 |                                                .
 |                                               .
 |                                              .
 |                                             .
 |                                            .
 |                                           .
 |                                          .
 |                                         .
 |                                        .
 |                                       .
 |                                      .
 |                                     .           * 74
 |                                    .           *
 |                                   .           *
 |                                  .          **
 |                                 .          *
 |                                .          *
 |                               .          *
 |                              .          *
 |                             .          *
 |                            .          *
 |                           .         **
 |                          .        **
 |                         .        *
 |                        .        *
 |                       .       **
 |                      .       *
 |                     .       *
 |                    .       *
 |                   .       *
 |                  .      **
 |                 .     **
 |                .     *
 |               .     *
 |              .     *
 |             .    **
 |            .    *
 |           .    *
 |          .   **
 |         .   *
 |        .   *
 |       .   *
 |      .   *
 |     .  **
 |    . **
 |   ***
 |  *
 | *
 ___________________________________________________


(that should really plot the derivative of the second line as well, but you
 can see that it very quickly got close to parallel with the total number).

This is utterly unscientific I know, but very encouraging.  Not one of the
misclassifications was an FP!  Though that's probably down to the fact that
most of the early messages I trained it on were hams.  This could be worth
bearing in mind when thinking about training strategies (if I'm right) -
since FPs are more damaging than FNs, maybe people should be encouraged
(forced?) to train on a bunch of hams before any spams.

-- 
Richie Hindle
richie@entrian.com




More information about the Spambayes mailing list