[Spambayes] (no subject)
tameyer at ihug.co.nz
Wed May 24 23:44:46 CEST 2006
>> The source distribution includes sb_dbexpimp.py, which can convert
>> the database to/from a CSV file.
> it sounds like i should change the code and compile it and so one.
No, I didn't say anything about changing the code. You simply
install Python, download the SpamBayes source, and run a command-line
> But i haveno such bis skiles in programming. Is there another
> way to it?
Not one that doesn't involve downloading Python & using the command
line. It seems pretty unlikely that there's anyone that would need
to see the raw token counts who is unable to use command-line tools.
> Here is the Spam Clue of one of this email.
> This Email had only the subject,Test Mail and noting else in the body
> (but the email provider add automaticly a advertistment to the end
> of the
What are you trying to achieve here? Unless you've trained on a lot
of spam like this (i.e. "Test" or "Mail" in the subject and a ad-only
body), SpamBayes isn't going to classify it as spam. (And, in fact,
you have trained a message with the subject "Test Mail" as *ham*!).
There are a few spammy clues in the message - particularly the urls,
but there are many more ham clues, particularly:
> token spamprob #ham #spam
> 'auf' 0.0238095 9 0
> 'bei' 0.0348837 6 0
> 'noch' 0.0348837 6 0
> 'ein' 0.0505618 4 0
> 'gruss' 0.0918367 2 0
These have never been seen in spam, and were in the body of the
message, so were presumably in the advertisement. If the
advertisement doesn't change with each message, that means that
you've never trained a message with the advertisement as spam - so
SpamBayes is absolutely correct in classifying the message as ham.
(As an aside: SpamBayes was created, for the most part, by English
speakers. The process should still work in other white-space
delimited languages, but there may be a few issues. For example,
SpamBayes ignores any tokens that are fewer than 3 characters long -
which includes 'worthless' English words like "a", "be", "to", "my",
and so on. However, many of these words are longer in German, so
perhaps performance would be better with a lower limit of 4 (or maybe
too much useful information would be lost then). It would need
experimentation to know for sure).
> 'allein?' 0.155172 1 0
> 'aufnehmen' 0.155172 1 0
> 'beliebteste' 0.155172 1 0
> 'bye' 0.155172 1 0
> 'date!' 0.155172 1 0
> 'kontakt' 0.155172 1 0
> 'messagegruss' 0.155172 1 0
> 'schnell' 0.155172 1 0
> 'singles' 0.155172 1 0
> 'subject:Test Mail' 0.155172 1 0
> 'url:114845986687261' 0.155172 1 0
> 'url:11512' 0.155172 1 0
> 'url:singles' 0.155172 1 0
> 'warten' 0.155172 1 0
These have all been seen in a single ham message and no spam. They
are enough (with the others) to counter the few URL spam clues.
Does this make things any more clear? (I'm still not really sure
what you are trying to do).
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
More information about the SpamBayes