[Gllug] Whitelist-only spam filtering

Thu Sep 12 15:27:38 UTC 2002

Hi,

The only problems I can see with using Bayesian filters are quite
specific to the requirements I have for spam filtering, which probably
don't apply to your average home users mailbox.

First, I'm using filtering for mailboxes (shared IMAP ones) which
receive e-mail from a wide number of unknown people about a wide range
of subjects (our main good at positive-internet.com address for example
appears on letterheads, posters, mag adverts, taxi's etc.) and these
e-mails often are being sent for the attention of a number of different
people within our company from a wide variety of people on a wide
variety of subjects.

For this reason, a pattern for what we'd expect to receive would be
based upon an ever changing mix of who we happen to be employing that
year and where and how we choose to advertise the e-mail address at any
given point and perhaps even which products we are offering.

A similar thing would I presume apply if you wanted a single Bayesian
filter to filter out several users mail in a proxy/style fashion because
it appears that for maximum effectiveness the system has to have a
different statistical database for each user/mailbox.

Secondly, from my understanding of the article (which I did skim a
little) you have to be more proactive with actually making the decisions
about what is spam and what is not (at first at least) in order for the
system to begin to learn over time. With Spam Assassin (or similar
software, if there is any) the benefits to running it were seen
instantly with virtualy no configuration required.

I haven't tried a Bayesian based system though and it does sound quite
good, so perhaps I might try it and see anyway. As I've said before I'm
finding my current set up 99.9% effective anyway.

The quickest way to fill your mailbox with spam by the way is to
register a lot of domains using your e-mail address in the various
contact fields. Wait a year or less and hey presto.

Jake.

On Thu, 2002-09-12 at 14:18, Tom Gilbert wrote:

> Then you failed to read or understand the article. Might I draw your
> attention to what Paul said in the article I pointed you to:
>   
>     To beat Bayesian filters, it would not be enough for spammers to
>     make their emails unique or to stop using individual naughty words.
>     They'd have to make their mails indistinguishable from your ordinary
>     mail. And this I think would severely constrain them. Spam is mostly
>     sales pitches, so unless your regular mail is all sales pitches,
>     spams will inevitably have a different character. And the spammers
>     would also, of course, have to change (and keep changing) their
>     whole infrastructure, because otherwise the headers would look as
>     bad to the Bayesian filters as ever, no matter what they did to the
>     message body. I don't know enough about the infrastructure that
>     spammers use to know how hard it would be to make the headers look
>     innocent, but my guess is that it would be even harder than making
>     the message look innocent.
> 
> Also, from the FAQ, which you might have missed,
> http://www.paulgraham.com/spamfaq.html:
> 
>     Once this software was available, couldn't spammers just tune their
>     spams to get through it?
> 
>     They couldn't necessarily tune their emails and still say what they
>     wanted to say. If they wanted to send you to a url that is known to
>     the filters, for example, they would find it hard to tune their way
>     around that.
> 
>     Second, tune using what? Each user's filters will be different, and
>     the innocent words will vary especially. At most, spammers will be
>     able to dilute their mails with merely neutral words, and those will
>     not tend to be much use because they won't be among the fifteen most
>     interesting.
> 
>     If the spammers did try to get most of the incriminating words out
>     of their messages, they would all have to use different euphemisms,
>     because if they all started saying "adolescents" instead of "teens",
>     then "adolescents" would start to have a high spam probability.
> 
>     Finally, even if spammers worked to get all the incriminating words
>     out of the message body, that wouldn't be enough, because in a
>     typical spam a lot of the incriminating words are in the headers.
> 
> 

-- 
Gllug mailing list  -  Gllug at linux.co.uk
http://list.ftech.net/mailman/listinfo/gllug