[Spambayes] (no subject)

Tony Meyer tameyer at ihug.co.nz
Wed May 24 23:44:46 CEST 2006


>> The source distribution includes sb_dbexpimp.py, which can convert
>> the database to/from a CSV file.
>
> it sounds like i should change the code and compile it and so one.

No, I didn't say anything about changing the code.  You simply  
install Python, download the SpamBayes source, and run a command-line  
tool.

> But i haveno such bis skiles in programming. Is there another
> way to it?

Not one that doesn't involve downloading Python & using the command  
line.  It seems pretty unlikely that there's anyone that would need  
to see the raw token counts who is unable to use command-line tools.

> Here is the Spam Clue of one of this email.
> This Email had only the subject,Test Mail and noting else in the body
> (but the email provider add automaticly a advertistment to the end  
> of the
> email)

What are you trying to achieve here?  Unless you've trained on a lot  
of spam like this (i.e. "Test" or "Mail" in the subject and a ad-only  
body), SpamBayes isn't going to classify it as spam.  (And, in fact,  
you have trained a message with the subject "Test Mail" as *ham*!).

There are a few spammy clues in the message - particularly the urls,  
but there are many more ham clues, particularly:

> token                               spamprob         #ham  #spam
> 'auf'                               0.0238095           9      0
> 'bei'                               0.0348837           6      0
> 'noch'                              0.0348837           6      0
> 'ein'                               0.0505618           4      0
> 'gruss'                             0.0918367           2      0

These have never been seen in spam, and were in the body of the  
message, so were presumably in the advertisement.  If the  
advertisement doesn't change with each message, that means that  
you've never trained a message with the advertisement as spam - so  
SpamBayes is absolutely correct in classifying the message as ham.

(As an aside: SpamBayes was created, for the most part, by English  
speakers.  The process should still work in other white-space  
delimited languages, but there may be a few issues.  For example,  
SpamBayes ignores any tokens that are fewer than 3 characters long -  
which includes 'worthless' English words like "a", "be", "to", "my",  
and so on.  However, many of these words are longer in German, so  
perhaps performance would be better with a lower limit of 4 (or maybe  
too much useful information would be lost then).  It would need  
experimentation to know for sure).

> 'allein?'                           0.155172            1      0
> 'aufnehmen'                         0.155172            1      0
> 'beliebteste'                       0.155172            1      0
> 'bye'                               0.155172            1      0
> 'date!'                             0.155172            1      0
> 'kontakt'                           0.155172            1      0
> 'messagegruss'                      0.155172            1      0
> 'schnell'                           0.155172            1      0
> 'singles'                           0.155172            1      0
> 'subject:Test Mail'                 0.155172            1      0
> 'url:114845986687261'               0.155172            1      0
> 'url:11512'                         0.155172            1      0
> 'url:singles'                       0.155172            1      0
> 'warten'                            0.155172            1      0

These have all been seen in a single ham message and no spam.  They  
are enough (with the others) to counter the few URL spam clues.

Does this make things any more clear?  (I'm still not really sure  
what you are trying to do).

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.




More information about the SpamBayes mailing list