[spambayes-dev] problems locating messages with bigrams

Toby Dickenson tdickenson at devmail.geminidataloggers.co.uk
Tue Jan 6 13:38:45 EST 2004


On Tuesday 06 January 2004 18:21, Skip Montanaro wrote:

> Another apparently strongly hammy token (prob 0.092) had me confused for a
> bit.  When I ran extractmessages.py to identify the messages containing
> 'bi:skip:w 20 skip:w 10', only two hams and two spams turned up.

I cant help with any bigram problem, but I recognise those "skip:w 20" tokens. 

The tokenizer only performs its special URL handling if a URL includes the 
http: prefix. For a URL that omits that prefix and starts with www, all we 
get is one skip token.

I have a patch that fixes this in the url-detecting regular expression:

http://sourceforge.net/tracker/?func=detail&aid=830290&group_id=61702&atid=498105

-- 
Toby Dickenson




More information about the spambayes-dev mailing list