[spambayes-dev] Another tweak to try - asciify_subject
Skip Montanaro
skip at pobox.com
Mon Nov 10 17:49:43 EST 2003
We're all familiar with the recent attempts to foil spam filters by adding
Latin-1 accents to message subjects (and sometimes to message bodies):
We cän makë it lönger now
The attached context diff maps subjects through a "latscii" codec I wrote
which does little more than strip accents. (It also maps various symbols to
reasonable ASCII equivalents, like mapping '¡' -> '!'.) This showed a small
improvement in false negatives for me (1 out of 10 on the timcv meter, n ==
10, 500 messages per bucket) and no change in false positives:
false positive percentages
0.600 0.600 tied
0.000 0.000 tied
0.200 0.200 tied
0.400 0.400 tied
0.000 0.000 tied
0.800 0.800 tied
0.200 0.200 tied
0.800 0.800 tied
0.200 0.200 tied
0.400 0.400 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 18 to 18 tied
mean fp % went from 0.36 to 0.36 tied
false negative percentages
2.200 2.200 tied
1.000 1.000 tied
2.200 2.000 won -9.09%
3.000 3.000 tied
1.600 1.600 tied
2.000 2.000 tied
1.000 1.000 tied
2.000 2.000 tied
1.600 1.600 tied
1.400 1.400 tied
won 1 times
tied 9 times
lost 0 times
total unique fn went from 90 to 89 won -1.11%
mean fn % went from 1.8 to 1.78 won -1.11%
ham mean ham sdev
4.92 4.94 +0.41% 14.95 14.98 +0.20%
5.14 5.16 +0.39% 15.47 15.48 +0.06%
4.89 4.90 +0.20% 14.51 14.53 +0.14%
5.31 5.34 +0.56% 15.80 15.85 +0.32%
4.61 4.62 +0.22% 14.80 14.83 +0.20%
5.71 5.75 +0.70% 17.21 17.28 +0.41%
4.32 4.33 +0.23% 13.45 13.50 +0.37%
4.83 4.85 +0.41% 14.83 14.87 +0.27%
4.38 4.38 +0.00% 13.97 14.02 +0.36%
5.96 5.97 +0.17% 17.38 17.40 +0.12%
ham mean and sdev for all runs
5.01 5.02 +0.20% 15.29 15.33 +0.26%
spam mean spam sdev
90.76 90.84 +0.09% 19.66 19.58 -0.41%
91.16 91.23 +0.08% 17.64 17.57 -0.40%
91.25 91.29 +0.04% 18.84 18.79 -0.27%
88.31 88.36 +0.06% 22.55 22.49 -0.27%
90.54 90.62 +0.09% 18.50 18.42 -0.43%
91.64 91.68 +0.04% 17.75 17.69 -0.34%
91.19 91.33 +0.15% 17.82 17.71 -0.62%
91.66 91.69 +0.03% 18.76 18.74 -0.11%
91.31 91.39 +0.09% 17.97 17.85 -0.67%
91.87 91.96 +0.10% 17.07 16.97 -0.59%
spam mean and sdev for all runs
90.97 91.04 +0.08% 18.74 18.66 -0.43%
ham/spam mean difference: 85.96 86.02 +0.06
If you test this out, it will have no effect if you don't have any messages
in your training databases which use this trick. When I first ran it, I
hadn't factored in any recent messages and saw nothing. After I ran
splitndirs.py over my current small (153 spam, 102 ham) training databases,
then ran rebal -n 300 followed by rebal -n 500 to stir the pot a bit, I saw
the above changes.
While I was at it, I wrote a simple Makefile to run the cross validation
tests. This should speed things up in the common case where your training
database and your base.ini file don't change (cutting processing time
approximately in half). Use it like so:
make BASE=std TRIAL=ascii
A plain
make
assumes your base and trial option files are std.ini and trial.ini,
respectively.
Skip
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Makefile
Type: application/octet-stream
Size: 737 bytes
Desc: Makefile for running cross validations
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031110/9d95bb33/Makefile.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb.diff
Type: application/octet-stream
Size: 6759 bytes
Desc: asciify_subject tweak
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031110/9d95bb33/sb.obj
More information about the spambayes-dev
mailing list