[Spambayes-checkins] spambayes tokenizer.py,1.19,1.20

Tim Peters tim_one@users.sourceforge.net
Thu, 12 Sep 2002 16:59:08 -0700


Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18626

Modified Files:
	tokenizer.py 
Log Message:
crack_urls():  Simpler tagging of embedded http etc thingies.  Test
results show that the fine distinctions being drawn were a waste of code:

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.025  won    -50.00%
    0.000  0.000  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.075  0.075  tied
    0.025  0.025  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.050  0.025  won    -50.00%
    0.000  0.000  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.025  0.025  tied
    0.000  0.000  tied
    0.025  0.025  tied
    0.050  0.025  won    -50.00%

won   3 times
tied 17 times
lost  0 times

total unique fp went from 8 to 8 tied

false negative percentages
    0.218  0.218  tied
    0.364  0.364  tied
    0.291  0.327  lost   +12.37%
    0.509  0.545  lost    +7.07%
    0.400  0.400  tied
    0.218  0.218  tied
    0.218  0.218  tied
    0.582  0.545  won     -6.36%
    0.291  0.291  tied
    0.255  0.255  tied
    0.291  0.291  tied
    0.582  0.582  tied
    0.545  0.545  tied
    0.255  0.255  tied
    0.255  0.255  tied
    0.400  0.400  tied
    0.291  0.291  tied
    0.218  0.218  tied
    0.182  0.182  tied
    0.145  0.182  lost   +25.52%

won   1 times
tied 16 times
lost  3 times

total unique fn went from 86 to 87 lost    +1.16%


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** tokenizer.py	12 Sep 2002 04:19:38 -0000	1.19
--- tokenizer.py	12 Sep 2002 23:59:06 -0000	1.20
***************
*** 802,809 ****
          while guts and guts[-1] in '.:?!/':
              guts = guts[:-1]
!         for i, piece in enumerate(guts.split('/')):
!             prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1')
              for chunk in urlsep_re.split(piece):
!                 pushclue(prefix + chunk)
  
          i = end
--- 802,808 ----
          while guts and guts[-1] in '.:?!/':
              guts = guts[:-1]
!         for piece in guts.split('/'):
              for chunk in urlsep_re.split(piece):
!                 pushclue("url:" + chunk)
  
          i = end