[spambayes-dev] Chung-Kwei algorithm (from BBC news)

Firmicus at ankabut.net Firmicus at ankabut.net
Wed Aug 25 18:22:12 CEST 2004


Hello spambayes developers,

Heard of this yet?

Regards,

F

========

'DNA analysis' spots e-mail spam

By Jo Twist

BBC News Online science and technology staff


Few would have thought that when Crick and Watson discovered DNA, it would help in making a tool to fight spam.

But computational biologists at IBM's TJ Watson Research Center have devised an anti-spam filter based on the way scientists analyse genetic sequences.

Called after Feng Shui character Chung-Kwei, the formula automatically learns patterns of spam vocabulary and has proved to be 96.5% efficient.

In tests, the filter only misidentified one message in 6,000 as spam.

Pillar of protection

Isidore Rigoutsos and Tien Huynh, at IBM's bioinformatics and pattern discovery research group, started to develop the formula - or algorithm - a little over a year ago.

They named the formula, Chung-Kwei, after a Feng Shui character who is usually shown carrying a bat, and also holds a sword behind him.

He is an important figure for those involved in business and who have expensive goods that need protection.

Chung-Kwei grew out of another algorithm called Teiresias which the researchers were using for pattern discovery in computation biology sequencing, specifically, in protein annotation.

	"To train 88,000 messages takes about 15 minutes on a normal single processor. 
        If, an hour, later we have more spam we can add to the collection so we keep 
        on learning more and more"
        -- Isidore Rigoutsos, IBM

The algorithm helped in automatically determining the properties of a protein, like function and structure, directly from a string.

"Obviously algorithms that pertain to pattern discovery are applicable to a vast range of problems," Mr Rigoutsos explained to BBC News Online.

Instead of looking at strings of protein, Chung-Kwei uses Teiresias to identify strings of character sequences which appear in spam, but never in non-spam mail.

Their work, said Mr Rigoutsos, was helped by the large volume of spam which they received at their own workplace.

"We have lots of e-mails that we know are bona fide spam. If we run a pattern analysis on those, it can see letters that appear frequently.

"One of the properties of the algorithm is that it will spot two or more occurrences. It doesn't matter where it is in the message.

"If you do this, effectively you get small collections of letters so you can think of these as a vocabulary of sorts. If you have lots of data to work with, your vocabulary will be able to describe the data in a different form."

Spam training

The algorithm can be trained so that it will not be fooled by cunning replacements of "S" with "$", a common ploy used by spammers to bypass conventional e-mail filters.

The Chung-Kwei method builds up its database of known true-spam patterns and constantly adds new patterns it spots.

It compares its vocabulary to e-mails which it knows do not contain spam. So, an incoming message hit with this pattern analysis will be rejected if it contains a large proportion of the same vocabulary patterns.

If a message received had a lot of spam patterns in it, it was scored highly. Chung-Kwei succeeded in spotting almost 97% of junk mails.

"We experimented with large collections of e-mail. We have 66,000 training messages that are all spam and 22,000 training messages that are all 'white' [non-spam].

"To train 88,000 messages takes about 15 minutes on a normal single processor. If, an hour, later we have more spam, we can add to the collection so we keep on learning more and more."

Various anti-spam software use several techniques to spot and kill junk mail, but IBM believes the Chung-Kwei algorithm to be the only anti-spam tool that uses pattern discovery in this way.

Some tools look at the route an e-mail has taken and its origins; others involve identity verification and black and white listing of accepted and not accepted addresses.

Others use Bayesian combinations of individual words that statistically make up spam messages.

The system has to go through some more pilot studies and testing before it is let loose to protect inboxes.

The research was originally reported in the New Scientist magazine.

Story from BBC NEWS:
http://news.bbc.co.uk/go/pr/fr/-/2/hi/technology/3584534.stm

Published: 2004/08/25 09:38:12 GMT

© BBC MMIV



More information about the spambayes-dev mailing list