[Spambayes] New tokenization of the Subject line

papaDoc papaDoc@videotron.ca
Mon, 07 Oct 2002 08:17:02 -0400


Hi,

>[Remi Ricard]
>  
>
>>I try something again.
>>
>>Since most of the mail from subscribed groups have in their
>>subject [spambayes] or [freesco] i.e "[" and "]".
>>I decided to keep this as a word so my words from a subject line
>>like: Re: [Spambayes] Moving closer to Gary's ideal
>>will be
>>Re:
>>[Spambayes]
>>Moving
>>closer
>>to
>>Gary's
>>ideal
>>    
>>
>
>Two things about that:
>
>1. It's not a precise enough description to know exactly what you
>   did.  On a list with programmers, don't be afraid to show code <wink>.
>
>2. Do you think it's more likely that a spam would have "freesco"
>   than "[freesco]" in its Subject line?  Not bloodly likely <wink>.
>   That is, you couldn't have picked worse examples for selling the
>   idea that this *might* help.  Indeed, that may be why it didn't
>   help.
>
>
>It's usually more fruitful to stare at mistakes made by the system, and then
>see if there's something about them in common that the tokenizer isn't
>presenting in a usable way (very clear example:  we throw away uuencoded
>pieces entirely; very muddy example:  we throw away info about how many
>times a word appears in a msg).
>

OK this is the code
I changed this
subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
punctuation_run_re = re.compile(r'\W+')
for
subject_word_re = re.compile(r"[\w\x80-\xff\[\]$.%]+")
punctuation_run_re = re.compile(r'\W^\[^\]+')

Why I did that is because I found this "prob(subject: '[') 0.0012345 and 
prob(subject: ']') 0.0012345
and usually I have a '[' of ']' in the subject if I have 
"[someword_from_a_mailing_list]" so
instead of having '['   'someword_from_a_mailing_list' and ']' as three 
token why not using
[someword_from_a_mailing_list] as one token.


I is more likely that a ham will have in its subject [freesco] than only 
freesco "for my case" and I think
a spam won't have at all freesco in its subject. (This is a clean 
mailing list he he.. this is still possible.....)

And I don't want a spam with a subject like: "[[[[[[New free porn 
site]]]]]]"  to have its '[' and ']' to count
as ham.

papaDoc


P.S Thanks for the statistic explanation of my result :-)