[spambayes-dev] problems locating messages with bigrams
Toby Dickenson
tdickenson at devmail.geminidataloggers.co.uk
Tue Jan 6 13:38:45 EST 2004
On Tuesday 06 January 2004 18:21, Skip Montanaro wrote:
> Another apparently strongly hammy token (prob 0.092) had me confused for a
> bit. When I ran extractmessages.py to identify the messages containing
> 'bi:skip:w 20 skip:w 10', only two hams and two spams turned up.
I cant help with any bigram problem, but I recognise those "skip:w 20" tokens.
The tokenizer only performs its special URL handling if a URL includes the
http: prefix. For a URL that omits that prefix and starts with www, all we
get is one skip token.
I have a patch that fixes this in the url-detecting regular expression:
http://sourceforge.net/tracker/?func=detail&aid=830290&group_id=61702&atid=498105
--
Toby Dickenson
More information about the spambayes-dev
mailing list