[Spambayes] Tokenizing numbers and money

Rob Hooft rob@hooft.net
Wed Oct 16 20:40:45 2002


Tim Peters wrote:

> You can try it, although it fights the "stupid beats smart" meta-rule.  It's
> easy to think of examples in the other direction too.  For example, I get an
> electronic order receipt with an order number, and a few days later get a
> shipping confirmation referencing the same number.  If I trained on the
> order receipt between times, that "senseless number" is certainly going to
> help the shipping confirmation score low.
> 
> 
>>How about something like tokens for
>>
>>    num:float     (e.g. 3624.2)
>>    num:int       (e.g. 3629)
>>    num:intpair   (e.g. 439,443)
>>    num:$1        (for amounts between $0.00 and $9.99)
>>    num:$10       (for amounts between $10 and $99.99)
>>    num:$100      (for amounts between $100 and $999.99)
>>    num:$1000     (for amounts between $1k and $10k)
>>    num:$huge     (for amounts >$10k)
>>
>>Each of these might have "logarithm suffixes"? Is this unrealistic?
> 
> 
> It's realistic to try it, but more expensive than the tokenization we do now
> (we do nothing at all for "words" of under 13 chars now except determine
> their length; the split-on-whitespace business goes at C speed).

More expensive, but I didn't notice it yet. First results: It doesn't 
make a difference.

cv5: original code
cv8: with "num:XXX" tokens for simple numerics

amigo[109]spambayes%% grep -A1 'all runs' cv5.txt
-> <stat> Ham scores for all runs: 16000 items; mean 0.59; sdev 4.96
-> <stat> min -1.22125e-13; median 1.3603e-11; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 99.02; sdev 5.86
-> <stat> min 6.85483e-09; median 100; max 100
amigo[110]spambayes%% grep -A1 'all runs' cv8.txt
-> <stat> Ham scores for all runs: 16000 items; mean 0.60; sdev 5.00
-> <stat> min -1.44329e-13; median 2.66842e-11; max 100
--
-> <stat> Spam scores for all runs: 5800 items; mean 99.04; sdev 5.74
-> <stat> min 7.69111e-09; median 100; max 100


cv8 now has the following tokens:

prob   nham nspam token
0.0082   27    0 num:float8
0.0088   25    0 num:signfloat6
0.0122   18    0 num:signfloat5
0.0137   16    0 num:signfloat4
0.0138  657    9 num:signint3
0.0167   13    0 num:signfloat7
0.0197   11    0 num:int12
0.0266    8    0 num:float10
0.0266    8    0 num:signfloat8
0.0302    7    0 num:signfloat9
0.0302    7    0 num:signint6
0.0413    5    0 num:signfloat10
0.0506    4    0 num:signfloat11
0.0868  265   25 num:signint5
0.0911   12    1 num:float9
0.1539  111   20 num:int10
0.1552    1    0 num:expfloat10
0.1552    1    0 num:float12
0.1552    1    0 num:signint9
0.1566   71   13 num:float7
0.1654   11    2 num:signint4
0.2085  255   67 num:int7
0.2248    4    1 num:float11
0.2656   64   23 num:int9
0.2935  164   68 num:float5
0.3138  431  197 num:float3
0.3196  194   91 num:float4
0.3550  151   83 num:int6
0.3596 1900 1067 num:int4
0.4041   65   44 num:int8
0.4255 1218  902 num:int3
0.4369  687  533 num:int5
0.4399   65   51 num:float6
0.4471    5    4 num:signint11
0.7432    4   12 num:int11
0.7752    1    4 num:signint8
0.8133  127  554 num:intpair
0.8356    1    6 num:signint10
0.9082    0    2 num:money12
0.9180    9  103 num:money5
0.9383   24  368 num:money4
0.9587    0    5 num:exclmoney12
0.9587    0    5 num:money9
0.9599    2   53 num:fracmoney9
0.9700   13  428 num:money3
0.9730    1   44 num:money10
0.9734    0    8 num:fracmoney4
0.9762    0    9 num:exclmoney4
0.9785    0   10 num:exclmoney11
0.9785    0   10 num:exclmoney9
0.9788    4  195 num:money8
0.9794    1   58 num:exclmoney8
0.9796    4  203 num:money6
0.9833    0   13 num:money11
0.9863    0   16 num:exclmoney5
0.9900    4  417 num:fracmoney6
0.9904    0   23 num:exclmoney10
0.9912    2  249 num:fracmoney7
0.9920    2  274 num:fracmoney5
0.9933    0   33 num:fracmoney8
0.9937    0   35 num:exclmoney6
0.9954    1  262 num:money7
0.9956    0   51 num:exclmoney7
0.9974    0   86 num:fracmoney11
0.9975    0   89 num:fracmoney10

Dead end? Or is the reduction in number of tokens significant?

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/