[Spambayes-checkins] spambayes/spambayes tokenizer.py,1.10,1.11

Tim Peters tim_one at users.sourceforge.net
Tue May 20 09:07:38 EDT 2003


Update of /cvsroot/spambayes/spambayes/spambayes
In directory sc8-pr-cvs1:/tmp/cvs-serv16137/spambayes

Modified Files:
	tokenizer.py 
Log Message:
DIgging into a pile of high-scoring Unsures showed this trick:

    yo&#117;r se<!XE>p&#116;ic sys&#116;em

as a way to disguise "your septic system".  Bite the bullet and decode
numeric character entities.  Also replace <p> and <br> tags with single
blanks, since browsers break text visually when they see one of these.


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -d -r1.10 -r1.11
*** tokenizer.py	20 May 2003 01:49:23 -0000	1.10
--- tokenizer.py	20 May 2003 15:07:34 -0000	1.11
***************
*** 1027,1030 ****
--- 1027,1048 ----
          yield bingo
  
+ 
+ 
+ numeric_entity_re = re.compile(r'&#(\d+);')
+ def numeric_entity_replacer(m):
+     try:
+         return chr(int(m.group(1)))
+     except:
+         return '?'
+ 
+ 
+ breaking_entity_re = re.compile(r"""
+     &nbsp;
+ |   < (?: p
+       |   br
+       )
+     >
+ """, re.VERBOSE)
+ 
  class Tokenizer:
  
***************
*** 1354,1357 ****
--- 1372,1379 ----
                  continue
  
+             # Replace numeric character entities (like &#97; for the letter
+             # 'a').
+             text = numeric_entity_re.sub(numeric_entity_replacer, text)
+ 
              # Normalize case.
              text = text.lower()
***************
*** 1374,1384 ****
                      yield t
  
!             # Remove HTML/XML tags.  Also &nbsp;.
!             text = text.replace('&nbsp;', ' ')
              # It's important to eliminate HTML tags rather than, e.g.,
              # replace them with a blank (as this code used to do), else
              # simple tricks like
              #    Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion
!             # can be used to disguise words.
              text = html_re.sub('', text)
  
--- 1396,1409 ----
                      yield t
  
!             # Remove HTML/XML tags.  Also &nbsp;.  <br> and <p> tags should
!             # create a space too.
!             text = breaking_entity_re.sub(' ', text)
              # It's important to eliminate HTML tags rather than, e.g.,
              # replace them with a blank (as this code used to do), else
              # simple tricks like
              #    Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion
!             # can be used to disguise words.  <br> and <p> were special-
!             # cased just above (because browsers break text on those,
!             # they can't be used to hide words effectively).
              text = html_re.sub('', text)
  





More information about the Spambayes-checkins mailing list