[spambayes-dev] Another tweak to try - asciify_subject

Skip Montanaro skip at pobox.com
Mon Nov 10 17:49:43 EST 2003


We're all familiar with the recent attempts to foil spam filters by adding
Latin-1 accents to message subjects (and sometimes to message bodies):

    We cän makë it lönger now

The attached context diff maps subjects through a "latscii" codec I wrote
which does little more than strip accents.  (It also maps various symbols to
reasonable ASCII equivalents, like mapping '¡' -> '!'.)  This showed a small
improvement in false negatives for me (1 out of 10 on the timcv meter, n ==
10, 500 messages per bucket) and no change in false positives:

    false positive percentages
	0.600  0.600  tied          
	0.000  0.000  tied          
	0.200  0.200  tied          
	0.400  0.400  tied          
	0.000  0.000  tied          
	0.800  0.800  tied          
	0.200  0.200  tied          
	0.800  0.800  tied          
	0.200  0.200  tied          
	0.400  0.400  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fp went from 18 to 18 tied          
    mean fp % went from 0.36 to 0.36 tied          

    false negative percentages
	2.200  2.200  tied          
	1.000  1.000  tied          
	2.200  2.000  won     -9.09%
	3.000  3.000  tied          
	1.600  1.600  tied          
	2.000  2.000  tied          
	1.000  1.000  tied          
	2.000  2.000  tied          
	1.600  1.600  tied          
	1.400  1.400  tied          

    won   1 times
    tied  9 times
    lost  0 times

    total unique fn went from 90 to 89 won     -1.11%
    mean fn % went from 1.8 to 1.78 won     -1.11%

    ham mean                     ham sdev
       4.92    4.94   +0.41%       14.95   14.98   +0.20%
       5.14    5.16   +0.39%       15.47   15.48   +0.06%
       4.89    4.90   +0.20%       14.51   14.53   +0.14%
       5.31    5.34   +0.56%       15.80   15.85   +0.32%
       4.61    4.62   +0.22%       14.80   14.83   +0.20%
       5.71    5.75   +0.70%       17.21   17.28   +0.41%
       4.32    4.33   +0.23%       13.45   13.50   +0.37%
       4.83    4.85   +0.41%       14.83   14.87   +0.27%
       4.38    4.38   +0.00%       13.97   14.02   +0.36%
       5.96    5.97   +0.17%       17.38   17.40   +0.12%

    ham mean and sdev for all runs
       5.01    5.02   +0.20%       15.29   15.33   +0.26%

    spam mean                    spam sdev
      90.76   90.84   +0.09%       19.66   19.58   -0.41%
      91.16   91.23   +0.08%       17.64   17.57   -0.40%
      91.25   91.29   +0.04%       18.84   18.79   -0.27%
      88.31   88.36   +0.06%       22.55   22.49   -0.27%
      90.54   90.62   +0.09%       18.50   18.42   -0.43%
      91.64   91.68   +0.04%       17.75   17.69   -0.34%
      91.19   91.33   +0.15%       17.82   17.71   -0.62%
      91.66   91.69   +0.03%       18.76   18.74   -0.11%
      91.31   91.39   +0.09%       17.97   17.85   -0.67%
      91.87   91.96   +0.10%       17.07   16.97   -0.59%

    spam mean and sdev for all runs
      90.97   91.04   +0.08%       18.74   18.66   -0.43%

    ham/spam mean difference: 85.96 86.02 +0.06

If you test this out, it will have no effect if you don't have any messages
in your training databases which use this trick.  When I first ran it, I
hadn't factored in any recent messages and saw nothing.  After I ran
splitndirs.py over my current small (153 spam, 102 ham) training databases,
then ran rebal -n 300 followed by rebal -n 500 to stir the pot a bit, I saw
the above changes.

While I was at it, I wrote a simple Makefile to run the cross validation
tests.  This should speed things up in the common case where your training
database and your base.ini file don't change (cutting processing time
approximately in half).  Use it like so:

    make BASE=std TRIAL=ascii

A plain 

    make

assumes your base and trial option files are std.ini and trial.ini,
respectively.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Makefile
Type: application/octet-stream
Size: 737 bytes
Desc: Makefile for running cross validations
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031110/9d95bb33/Makefile.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb.diff
Type: application/octet-stream
Size: 6759 bytes
Desc: asciify_subject tweak
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031110/9d95bb33/sb.obj


More information about the spambayes-dev mailing list