[Spambayes] Problems with classifying as spam

Ocean Ocean at cobaltnight.com
Thu Feb 4 16:18:32 CET 2010




Okay, so in Tokenizer.py, I made this change:


---------------------------------------

#  Orig lines
# subject_word_re = re.compile(r"[\w\x80-\xff$.%]+")
# punctuation_run_re = re.compile(r'\W+')

subject_word_re = re.compile(r"[a-zA-Z0-9\x80-\xff$.%]+")
punctuation_run_re = re.compile(r'[^a-zA-Z0-9]+')

---------------------------------------


	And now, those tokens are properly showing up as Subject tokens.
However, they're not showing up as body tokens (from the link text).

	I haven't yet figured out where the link text (as opposed to the URL
itself) is being processed, though.  So if someone could point me in the
right direction, I would appreciate it.



	That being said, if the rest of that text isn't from the web site
being linked to, then it must be hidden in the HTML part of the message.
Given how often it's being used to thwart anti-spam apps, I'm wondering if
the HTML portion should not be scanned if a message has both text as well as
HTML.






> 
> As an example, I received this email:
> 
> ------------------------------
> 
> Subject: ***Discount_Viagra_VXPL_Percocet*_Adderall****
> 
> Body:
> 
> <URL Link>***Discount_Viagra_VXPL_Percocet*_Adderall****!
> <Links to:> http://kashertqdum17.com/
> 
> ------------------------------
> 
> 
>         That's it.  The only text in the body of the message is that URL
> link. 
> 
> 
> There are two issues I see showing up:
> 
> 
> 1.  The subject and link text isn't being parsed properly.  Nowhere in the
> spam clues are the words "viagra", "percocet", or "adderall" showing up.
> The spam token involving the subject is "'subject:****'"  So, not only is
> SpamBayes not treating the underscores as word seperators, but it's not
even
> getting to the words, because it looks like it's getting choked up on the
> asterisks.
> 
>



More information about the SpamBayes mailing list