[sclug] Mozilla Mail junk filter not as clever as I expected

Jonathan H N Chin jc254 at newton.cam.ac.uk
Sat Oct 25 09:05:55 UTC 2003


From: Neil Haughton <n.a.haughton at bigfoot.com> wrote:
> Does anyone else have good experience with Mozilla Mail's Bayesian junk 
> filtering? I'm disappointed with my experiences so far.

I don't use that system, so I can't comment directly.

> The point is that there is an obvious and consistent theme to the 
> content that MM is failing to spot. I suppose that it must have sampled 
> about 500-600 so far, had automatically them transferred to my  trash 
> folder and subsequently deleted. Those that are let though I mark as 
> Junk and delete them by hand.

I believe that it is recommended to have on the order of 10,000
messages in the training set before one expects high reliability.
Using a notspam corpus as well as a spam corpus is important.

At the Institute, incoming mail is filtered centrally by the University
Computing Service using SpamAssassin and the MailScanner virus-scanner
and tags inserted. When it arrives at the Institute, I use bogofilter to
classify the message. I retrain about once a day using both correctly
classified and misclassified messages and an extra training set made up
of a selection of hand-classified messages from an assortment of my users.
Each of my spam and notspam corpora contain approximately 20,000 messages.

I personally use procmail to filter my email into three inboxes (In,
Unsure, Spam) and additionally filter messages that would go into
In elsewhere (eg. mailing-lists). Since setting up the system a few
months ago, I have personally seen zero false positives (ie. good in Spam)
and zero false negatives (spam in In). Spam receives more than ten times
as many messages as Unsure. Over half the messages in Unsure are spam.

My users typically only use In and Spam boxes. So far I have had
reported to me only a single false positive (which did in fact look
very much like spam).

In about 1% of cases, SpamAssassin gives positive rating (>0) to spam
that bogofilter doesn't score as 'Yes'. In another 1% of cases bogofilter
flags as 'Yes' spam that SpamAssassin has given a non-positive rating.
(These percentages assume I performed the calculations on my spam corpus
correctly.)


-jonathan

-- 
Jonathan H N Chin, 1 dan | deputy computer | Newton Institute, Cambridge, UK
<jc254 at newton.cam.ac.uk> | systems mangler | tel/fax: +44 1223 767091/330508

                "respondeo etsi mutabor" --Rosenstock-Huessy



More information about the Sclug mailing list