[Spambayes-checkins] spambayes tokenizer.py,1.57,1.58

Tim Peters tim_one@users.sourceforge.net
Thu Oct 31 06:42:51 2002


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv30231

Modified Files:
	tokenizer.py 
Log Message:
A new mini-phase of body tokenization scours HTML for common virus clues,
variations of

    <script    </script
    <iframe    </iframe
    src=cid:
    height=0   width=0

I'm seeing a lot of this in my personal email lately, and it usually ends
up in my Unsure folder because the msgs have almost nothing in them
except for a bit of triggering HTML.  Adding this stuff almost always
scores them as solid spam now, and had an no effect on my c.l.py test
(no change in FP or FN rates, insignificant improvement in Unsure rate):

filename:       cv    tcap
ham:spam:  20000:14000
                   20000:14000
fp total:        2       2
fp %:         0.01    0.01
fn total:        0       0
fn %:         0.00    0.00
unsure t:       97      96
unsure %:     0.29    0.28
real cost:  $39.40  $39.20
best cost:  $26.80  $26.80
h mean:       0.26    0.27
h sdev:       2.89    2.90
s mean:      99.94   99.94
s sdev:       1.44    1.44
mean diff:   99.68   99.67
k:           23.02   22.97


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.57
retrieving revision 1.58
diff -C2 -d -r1.57 -r1.58
*** tokenizer.py	29 Oct 2002 03:43:58 -0000	1.57
--- tokenizer.py	31 Oct 2002 06:42:48 -0000	1.58
***************
*** 948,951 ****
--- 948,967 ----
      return ''.join(new_text), clues
  
+ # Scan HTML for constructs often seen in viruses and worms.
+ # <script  </script
+ # <iframe  </iframe
+ # src=cid:
+ # height=0  width=0
+ 
+ virus_re = re.compile(r"""
+     < /? \s* (?: script | iframe) \b
+ |   \b src= ['"]? cid:
+ |   \b (?: height | width) = ['"]? 0
+ """, re.VERBOSE)
+ 
+ def find_html_virus_clues(text):
+     for bingo in virus_re.findall(text):
+         yield bingo
+ 
  class Tokenizer:
  
***************
*** 1219,1222 ****
--- 1235,1241 ----
              for t in tokens:
                  yield t
+ 
+             for t in find_html_virus_clues(text):
+                 yield "virus:%s" % t
  
              # Remove HTML/XML tags.  Also &nbsp;.