[Spambayes] Latest spammer trick stymied - QUESTION
T. Alexander Popiel
popiel at wolfskeep.com
Mon Mar 31 16:06:06 EST 2003
In message: <3E88CD74.4050405 at parducci.net>
bill parducci <bill at parducci.net> writes:
>currently, does spambayes treat a URL as a single token or is it parsed
URLs are parsed with the following code:
| urlsep_re = re.compile(r"[;?:@&=+,$.]")
| class URLStripper(Stripper):
| def __init__(self):
| # The empty regexp matches anything at once.
| Stripper.__init__(self, url_re.search, re.compile("").search)
| def tokenize(self, m):
| proto, guts = m.groups()
| tokens = ["proto:" + proto]
| pushclue = tokens.append
| # Lose the trailing punctuation for casual embedding, like:
| # The code is at http://mystuff.org/here? Didn't resolve.
| # or
| # I found it at http://mystuff.org/there/. Thanks!
| assert guts
| while guts and guts[-1] in '.:?!/':
| guts = guts[:-1]
| for piece in guts.split('/'):
| for chunk in urlsep_re.split(piece):
| pushclue("url:" + chunk)
| return tokens
>take the example: http://check.myspam.com/ad/junk?random=fsldkjflksj
That example would yield the tokens:
>it would seem that the most accurate way to evaluate this would be to
>parse using '/' (starting after 'http://'). that would allow spambayes
>to evaluate the domain (check.mypam.com) while giving it the ability to
>differentiate between directories (which may map to users on ISP
>systems: http://user.aol.com/niceguy vs. http://user.aol.com/spammer).
This already happens to some extent, though the I think there could
be better handling of the composite hostname and directory path...
to wit, I suspect that adding the following tokens would help:
I haven't tested this yet, but I further suspect that I will have
Tim Peters' problem: my results are already good enough that I won't
be able to say anything conclusive about it.
More information about the Spambayes