[Spambayes-checkins] spambayes/spambayes tokenizer.py,1.9,1.10

Tim Peters tim_one at users.sourceforge.net
Mon May 19 19:49:26 EDT 2003


Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv8882/spambayes

Modified Files:
	tokenizer.py 
Log Message:
I dug into a small collection of Unsures that looked like blatant spam,
and discovered they were all using this kind of trick:

    Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion

That is, disguising words by inserting HTML nonsense tags.  We replaced
each tag with a blank, yielding the pretty useless tokens "Wr", "inkle",
"Reduc" and "tion".  We previously fixed a similar problem using embedded
HTML comments.  I should have fixed this other one then.

Cute:  these all scored at the high end of my Unsure range anyway.  Now
they're all solidly spam.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** tokenizer.py	24 Apr 2003 07:43:50 -0000	1.9
--- tokenizer.py	20 May 2003 01:49:23 -0000	1.10
***************
*** 914,918 ****
                  # No matching end - act as if the open
                  # tag did not exist.
!                 pushretained(text[start:])                
                  break
              dummy, i = m.span()
--- 914,918 ----
                  # No matching end - act as if the open
                  # tag did not exist.
!                 pushretained(text[start:])
                  break
              dummy, i = m.span()
***************
*** 1376,1380 ****
              # Remove HTML/XML tags.  Also &nbsp;.
              text = text.replace('&nbsp;', ' ')
!             text = html_re.sub(' ', text)
  
              # Tokenize everything in the body.
--- 1376,1385 ----
              # Remove HTML/XML tags.  Also &nbsp;.
              text = text.replace('&nbsp;', ' ')
!             # It's important to eliminate HTML tags rather than, e.g.,
!             # replace them with a blank (as this code used to do), else
!             # simple tricks like
!             #    Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion
!             # can be used to disguise words.
!             text = html_re.sub('', text)
  
              # Tokenize everything in the body.





More information about the Spambayes-checkins mailing list