[Spambayes-checkins] spambayes tokenizer.py,1.19,1.20
Tim Peters
tim_one@users.sourceforge.net
Thu, 12 Sep 2002 16:59:08 -0700
Update of /cvsroot/spambayes/spambayes
In directory usw-pr-cvs1:/tmp/cvs-serv18626
Modified Files:
tokenizer.py
Log Message:
crack_urls(): Simpler tagging of embedded http etc thingies. Test
results show that the fine distinctions being drawn were a waste of code:
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.050 0.025 won -50.00%
0.000 0.000 tied
0.025 0.025 tied
0.000 0.000 tied
0.075 0.075 tied
0.025 0.025 tied
0.025 0.025 tied
0.000 0.000 tied
0.050 0.025 won -50.00%
0.000 0.000 tied
0.025 0.025 tied
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.025 0.025 tied
0.000 0.000 tied
0.025 0.025 tied
0.050 0.025 won -50.00%
won 3 times
tied 17 times
lost 0 times
total unique fp went from 8 to 8 tied
false negative percentages
0.218 0.218 tied
0.364 0.364 tied
0.291 0.327 lost +12.37%
0.509 0.545 lost +7.07%
0.400 0.400 tied
0.218 0.218 tied
0.218 0.218 tied
0.582 0.545 won -6.36%
0.291 0.291 tied
0.255 0.255 tied
0.291 0.291 tied
0.582 0.582 tied
0.545 0.545 tied
0.255 0.255 tied
0.255 0.255 tied
0.400 0.400 tied
0.291 0.291 tied
0.218 0.218 tied
0.182 0.182 tied
0.145 0.182 lost +25.52%
won 1 times
tied 16 times
lost 3 times
total unique fn went from 86 to 87 lost +1.16%
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.19
retrieving revision 1.20
diff -C2 -d -r1.19 -r1.20
*** tokenizer.py 12 Sep 2002 04:19:38 -0000 1.19
--- tokenizer.py 12 Sep 2002 23:59:06 -0000 1.20
***************
*** 802,809 ****
while guts and guts[-1] in '.:?!/':
guts = guts[:-1]
! for i, piece in enumerate(guts.split('/')):
! prefix = "%s%s:" % (proto, i < 2 and str(i) or '>1')
for chunk in urlsep_re.split(piece):
! pushclue(prefix + chunk)
i = end
--- 802,808 ----
while guts and guts[-1] in '.:?!/':
guts = guts[:-1]
! for piece in guts.split('/'):
for chunk in urlsep_re.split(piece):
! pushclue("url:" + chunk)
i = end