[Spambayes-checkins] spambayes/spambayes tokenizer.py,1.11,1.12

Skip Montanaro montanaro at users.sourceforge.net
Wed Jun 18 09:09:01 EDT 2003


Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv8764

Modified Files:
	tokenizer.py 
Log Message:
<comment>...</comment> is a Microsoft alternative for spelling <!-- ... -->
saw a message which used 

    Via<comment>6q5r7</comment>gra

to hide Viagra.  The usual HTML tag stripping didn't remove the nonsense
token.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** tokenizer.py	20 May 2003 15:07:34 -0000	1.11
--- tokenizer.py	18 Jun 2003 15:08:59 -0000	1.12
***************
*** 1005,1010 ****
  class CommentStripper(Stripper):
      def __init__(self):
!         Stripper.__init__(self, re.compile(r"<!--").search,
!                                 re.compile(r"-->").search)
  
  crack_html_comment = CommentStripper().analyze
--- 1005,1011 ----
  class CommentStripper(Stripper):
      def __init__(self):
!         Stripper.__init__(self,
!                           re.compile(r"<!--|<\s*comment\s*[^>]*>").search,
!                           re.compile(r"-->|</comment>").search)
  
  crack_html_comment = CommentStripper().analyze





More information about the Spambayes-checkins mailing list