[Spambayes-checkins] spambayes/spambayes tokenizer.py,1.11,1.12
Skip Montanaro
montanaro at users.sourceforge.net
Wed Jun 18 09:09:01 EDT 2003
Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv8764
Modified Files:
tokenizer.py
Log Message:
<comment>...</comment> is a Microsoft alternative for spelling <!-- ... -->
saw a message which used
Via<comment>6q5r7</comment>gra
to hide Viagra. The usual HTML tag stripping didn't remove the nonsense
token.
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.11
retrieving revision 1.12
diff -C2 -d -r1.11 -r1.12
*** tokenizer.py 20 May 2003 15:07:34 -0000 1.11
--- tokenizer.py 18 Jun 2003 15:08:59 -0000 1.12
***************
*** 1005,1010 ****
class CommentStripper(Stripper):
def __init__(self):
! Stripper.__init__(self, re.compile(r"<!--").search,
! re.compile(r"-->").search)
crack_html_comment = CommentStripper().analyze
--- 1005,1011 ----
class CommentStripper(Stripper):
def __init__(self):
! Stripper.__init__(self,
! re.compile(r"<!--|<\s*comment\s*[^>]*>").search,
! re.compile(r"-->|</comment>").search)
crack_html_comment = CommentStripper().analyze
More information about the Spambayes-checkins
mailing list