[Gllug] Re: www.spews.org - spamming blacklist

Nix nix at esperi.demon.co.uk
Sun Jun 8 23:18:46 UTC 2003


On 03 Jun 2003, Mike Brodbelt spake:
> SpamAssassin demonstrates quite well that netblock lookups are an
> unnecessary blunt instrument. Content based filtering and Bayesian
> analysis can remove spam more effectively

Actually it demonstrates the exact opposite. Using the statistics from
the last GA run in March (for SA 2.54, IIRC):

No net, no Bayes:

,----[ rules/STATISTICS.txt ]
| # SUMMARY for threshold 5.0:
| # Correctly non-spam: 130678  56.21%  (99.92% of non-spam corpus)
| # Correctly spam:      90057  38.74%  (88.55% of spam corpus)
| # False positives:       100  0.04%  (0.08% of nonspam,   4551 weighted)
| # False negatives:     11640  5.01%  (11.45% of spam,  39435 weighted)
| # Average score for spam:  16.4    nonspam: -1.3
| # Average for false-pos:   5.9  false-neg: 3.4
| # TOTAL:              232475  100.00%
`----

Net, no Bayes:

,----[ rules/STATISTICS-set2.txt ]
| # SUMMARY for threshold 5.0:
| # Correctly non-spam: 173760  72.63%  (99.91% of non-spam corpus)
| # Correctly spam:      61628  25.76%  (94.37% of spam corpus)
| # False positives:       160  0.07%  (0.09% of nonspam,   3545 weighted)
| # False negatives:      3680  1.54%  (5.63% of spam,  11930 weighted)
| # Average score for spam:  19.0    nonspam: -4.5
| # Average for false-pos:   5.9  false-neg: 3.2
| # TOTAL:              239228  100.00%
`----

56/38% versus 72/25%.

No net, Bayes:

,----[ rules/STATISTICS-set1.txt ]
| # SUMMARY for threshold 5.0:
| # Correctly non-spam:  80750  46.56%  (99.93% of non-spam corpus)
| # Correctly spam:      87718  50.58%  (94.72% of spam corpus)
| # False positives:        60  0.03%  (0.07% of nonspam,   3959 weighted)
| # False negatives:      4892  2.82%  (5.28% of spam,  15334 weighted)
| # Average score for spam:  19.7    nonspam: -1.5
| # Average for false-pos:   5.8  false-neg: 3.1
| # TOTAL:              173420  100.00%
`----

Net and Bayes:

,----[ rules/STATISTICS-set3.txt ]
| # SUMMARY for threshold 5.0:
| # Correctly non-spam:  75073  56.87%  (99.94% of non-spam corpus)
| # Correctly spam:      54359  41.18%  (95.56% of spam corpus)
| # False positives:        46  0.03%  (0.06% of nonspam,   2077 weighted)
| # False negatives:      2524  1.91%  (4.44% of spam,   8665 weighted)
| # Average score for spam:  20.2    nonspam: -4.8
| # Average for false-pos:   6.0  false-neg: 3.4
| # TOTAL:              132002  100.00%
`----

46/50% versus 56/41%.

Network tests *are* worth it; the trick is to find blacklists with a
good FP ratio. SPEWS is the opposite: duncf's tests indicated that 98%
of mail caught by SPEWS was nonspam.

The SBL, for instance, is a good list: 98.9% spam as of my last
mass-check-with-network-tests. That's worth using. SPEWS is not --- at
least, not if your goal is identifying spam. (But then, the SPEWS people
admit to other goals...)

-- 
`It is an unfortunate coincidence that the date locarchive.h was
 written (in hex) matches Ritchie's birthday (in octal).'
               -- Roland McGrath on the libc-alpha list

-- 
Gllug mailing list  -  Gllug at linux.co.uk
http://list.ftech.net/mailman/listinfo/gllug




More information about the GLLUG mailing list