[Gllug] DNA technique protects against 'evil' emails [New Scientist]
liamvictor at gmail.com
Thu Aug 19 14:17:36 UTC 2004
DNA technique protects against 'evil' emails
A technique originally designed to analyse DNA sequences is the latest
weapon in the war against spam. An algorithm named Chung-Kwei (after a
feng-shui talisman that protects the home against evil spirits) can
catch nearly 97 per cent of spam.
Chung-Kwei is based on the Teiresias algorithm, developed by the
bioinformatics research group at IBM's Thomas J Watson Research Center
in New York, US. Teiresias was designed to search different DNA and
amino acid sequences for recurring patterns, which often indicate
genetic structures that have an important role.
Instead of chains of characters representing DNA sequences, the
research group fed the algorithm 65,000 examples of known spam. Each
email was treated as a long, DNA-like chain of characters. Teiresias
identified six million recurring patterns in this collection, such as
Each pattern represented a common sequence of letters and numbers that
had appeared in more than one unsolicited message. The researchers
then ran a collection of known non-spam (dubbed "ham") through the
same process, and removed the patterns that occurred in both groups.
Incoming email was given a score based on how many spam patterns it
had. A long email that only had a few spammy sentences would get a
relatively low score; but one with many patterns spread across the
length of the message would score much higher. The Chung-Kwei
correctly identified 64,665 of 66,697 test messages as being spam or
96.56 per cent.
More importantly, its rate of misidentifying genuine email as spam was
just 1 in 6000 messages. Losing a single email in a torrent of spam is
a greater failing in a filter than letting the occasional spam email
Chung-Kwei deals with common spammer strategies to dodge
pattern-recognition schemes, such as replacing the s with a $, as in
"increa$e your $ex power" using its built-in tolerance for different,
but functionally equivalent, DNA sequences.
Just as in genetic analysis, Teiresias could be taught that CCC and
CCU codons both produce the same amino acid, proline, the anti-spam
system can be trained to accept $ and s as identical.
IBM intends to include Chung-Kwei in its commercial product, SpamGuru.
Justin Mason, who developed SpamAssassin, one of the most popular
open-source anti-spam filters, says that Chung-Kwei looks promising.
"I think there is still a lot of work to be done. But what is exciting
is not the particular algorithm, but the fact that IBM has shown there
is the entire field of bioinformatics techniques to explore in the
fight against spam."
Danny O'Brien, San Jose
Kind regards, Liam Delahunty
PS. Is it bad form to copy the article to this list? Personaly I
prefer it as I sometime am reading offline such as on a PDA.
Gllug mailing list - Gllug at gllug.org.uk
More information about the GLLUG