[Gllug] Procmail

Tue Feb 26 01:16:56 UTC 2002

* Harry <nmeweb at yahoo.co.uk> spake thus:
> I have started to weed on the type of mail particularly
> multipart/alternative.
> 
> > :0 :
> > * ? formail -cx"From" -cx"From:" -cx"Sender:" -cx"X-Envelope-Sender:" \
	| fgrep -is -f ~/Mail/blacklist
> > ~/Mail/spam
> 
> I do like this and I have tested it with my set up and it works a treat. I
> have tried to find a list of recognised domains that spam comes from to
> utilise this as I can name a few who send me crap on a regular basis. 

hotmail.com, yahoo.com, aol.com, msn.com, earthlink.net. Esp. the three
last ones. Oh, and don't forget the co.uk clones...

> I would definitely have a look at your file especially if it is
> commented for a numpty like me.

I have tried to keep it well commented precisely because I was trying to
make something that would be simple to use for people that find the
procmail syntax a bit terse. I started it because I didn't like
spamassassin (I couldn't make it work (the way I wanted anyway ;)).

I was thinking of releasing it on my webpage or something, but I am
still hiding behind the "needs more testing" poster. I guess I'm just
afraid someone will point out some serious flaw in my recipes :)

Fact is it actually works *very* well for me. YMMV of course.

Stig

-------------- next part --------------
PATH=$HOME/bin:/usr/local/bin:/usr/bin:/bin
MAILDIR=$HOME/Mail
DEFAULT=in.mbox
LOGFILE=/dev/null
VERBOSE=off

# This is the limit at which we consider the message spam.
limit=28

# This is the version number 
version=0.0.6

# Files section.

# This is the folder that spam will go to.
spam_folder=spam.mbox

# The Message_Id: headers of messages that are considered spam are stored
# here, and used to automagically black-hole messages that reference it. This
# is useful on mailing lists where follow-up to spam is a problem. 
spam_id=spam_id

# This file is where spammer addresses (possibly partial addresses) are
# stored. Messages to/from any of these are most likely spam, so junk them.
# TODO: make the score for this option configurable as well.
blacklist=blacklist

# This file is the opposite of the one above; mail from these addresses are
# never filed as spam.
whitelist=whitelist

# Sometimes, esp. when you post to various mailing lists or use fetchmail to
# get your mail via a slow connection, you experience getting the same mail
# twice. Set this to 'yes' if you want to recive duplicated messages twice
# (or more!).
recive_duplicated=no

# This file will contain any duplicated messages if the above is set.
duplicates=duplicates

# These variables are here to make the filter a bit easier to tweak; they are
# normal procmail-style regular expressions (CAVEAT: shellmetas must be
# escaped twice.

# A message that does not match any of my addresses in the header is more
# than likely spam.
my_addresses="((stigbrau at start|s-braut at online)\.no|@brautaset\.org)"

# If the message is not to me directly.
cc_to_me=5

# If a message is not addressed to me at all (matches mailing lists such as
# debian-user, but I found out that spam on these lists are much more
# frequent and as such it does make sense to give them a head start in the
# point-gathering process).
not_to_me=10

# These words in the body of the mail is positive.
positive_words="(Stig|Brautaset|:-\))"
positive_words_score=3

# I really detest excessive punctuation. This is for the subject line.
crap_chars="(!|?|\\*|\\$|\||-|\"|\')"

# Same as above, but many people use '-' and '*' in signatures, so this
# regular expression is for the body.
crap_chars_body="(!|?|\\||\\$)"

# This is the regular expression that we should skip when looking at subject
# lines.
re_skip_string="(re:|fwd:|fw:|sv:|\[gllug\])"

# I absolutely hate html mail. Set to "0" if you don't want to strip html.
html_content_score=13

# If the subject line ends in "[aoe09]" or something similar, it better not
# breach anything else; this is a *very* strong indication on spam.
spam_style_uniq_subject=`echo $(($limit - 1));`

# People that try to unsubscribe mailinglists annoy me to bits.
plonk_unsubscribers=`echo $(($limit - $not_to_me + 1));`

# I don't like lots of quoted material. The following means that the post
# have to have twice as many non-quoted lines as normal lines not to get the
# score. Set it to any value you wish. The score is the one that will be
# added to the message, regardless of severity. This might change.
quote_ratio=0.3
quote_score=5

# More severity configure options. 
no_subject=17
all_caps_subject=13
not_plain_text=7

# The actual rule set begins here. Little configuration should be needed from
# this point on. 

# If you want a different header to the "X-SPAM:" thingie that is default, it
# can be changed here.
xspam="X-SPAM:"

# Do not just remove this line. Because we add a newline before any of the
# extra X-SPAM: lines below, this variable needs to be set.
message="$xspam This is SpamTracker version $version by Stig Brautaset."
score=0

# Keep a backup copy of the latest 200 messages.
:0 c
backup
:0 ic
| cd backup && rm -f dummy `ls -t msg.* | sed -e 1,200d`

# We don't want to get duplicate mails.
:0 Whc: msgid.lock
* recive_duplicated ?? no
| formail -D 16384 msgid.cache
:0 a:
$duplicates

# This recipe is for adding addresses to the white/blacklists. 
:0
* $ ^To.*$my_addresses
* $ ^From.*$my_addresses
{
	# Add addresses from the subject line to the list of known good
	# senders.
	:0 i:
	* ^Subject:.*white[      ]*\/.+
	| echo $MATCH >> $whitelist

	# Add addresses from the subject line to the list of known spammers.
	:0 i:
	* ^Subject:.*black[      ]*\/.+
	| echo $MATCH >> $blacklist

	# Just plonk the thread, not the sender.
	:0 w:
	* ^Subject:.*thread
	| formail +1 -ds formail -x"Message-Id:" >> $spam_id

	# If I forward spam to myself, the sender(s) of the original mail
	# should be blacklisted, and the thread will be classified as spam.
	:0 
	* ^Subject: Fwd: 
	{
		# It should not be neccessary to take a copy of the message,
		# but I have not found out how to deal with this otherwise.
		:0 c:
		| formail +1 -ds formail -cx"Message-Id:" >> $spam_id 

		:0 c:
		| formail +1 -ds formail -cx"From:" | sed -e 's/.*\( \|<\)\([[:alnum:]@._-]\+.[A-z]\+\).*/\2/' >> $blacklist

		:0 :
		| formail +1 -ds >> $spam_folder
	}
}

# First we take out mail that is an answer to a mail previously black holed,
# or mail from people on our shitlist. This saves a bit of processing, and is
# very useful on mailing lists where much of the spam is people commenting on
# or replying to spam.
:0
* 1^0 ? formail -cx"References:" -x"In-Reply-To:" | fgrep -is -f $spam_id
* 1^0 ? formail -cx"From" -cx"From:" -cx"Sender:" -cx"X-Envelope-Sender:" | fgrep -is -f $blacklist
{
	:0 i:
	* ^Message-Id:\/.+
	| echo $MATCH >> $spam_id

	:0 hf
	| formail -A"$xspam Follow-up to previously catched spam, or from a blacklisted address."

	:0 :
	$spam_folder
}

# Filter out HTML, and leave a message that it is filtered. I am forever
# grateful to Bart Schaefer for this one. 
:0
* ^Content-Type: text/html
* $ $html_content_score^0 
{
	message=`echo -e "$message\n\
$xspam HTML Content.					SpamScore: $="`
	score="$score + $="

        :0 bfW
	| (echo "[html stripped]"; lynx -dump -force_html -stdin)

	:0 ahfw
	| formail -i"Content-Type: text/plain" 
}

:0 
* ^Cc:\/.*
* $ MATCH ?? $my_addresses
* $ $cc_to_me^0 
{
	message=`echo -e "$message\n\
$xspam Message not directly to me.			SpamScore: $="`
	score="$score + $="
}
TMP=$MATCH

:0 E
* ^To:\/.*
* $ ! MATCH ?? $my_addresses
* $ $not_to_me^0 
{
	message=`echo -e "$message \n\
$xspam Message not addressed to me.			SpamScore: $="`
	score="$score + $="
}

:0 i
* ^To:\/.*
TMP=| echo "$TMP $MATCH"

# Messages with many recipients are depreciated.
:0
* -5^0 
* 5^0.8 TMP ?? @
{
	message=`echo -e "$message \n\
$xspam To:/Cc: contain several addresses.		SpamScore: $="`
	score="$score + $="
}

:0 
* ^Content-Type:\/.*
* ! MATCH ?? (text/plain|multipart/signed)
* $ $not_plain_text^0  
{
	message=`echo -e "$message \n\
$xspam Format of message is not plain text.		SpamScore: $="`
	score="$score + $="
}

# If the message has no subject, or it consists entirely of spaces/tabs, it's
# likely spam, and SpamScore is written to the header. Otherwise extract the
# subject but dump any occurrence of Re:/fwd:/SV: in the beginning of the line
# and send it to the second part of this recipe which will check whether the
# extracted part contain any lower-case characters. If not, SpamScore are
# written to the header. Thanks to David W. Tamkin for this technique.
:0 
* $ ! ^Subject:[	 ]*($re_skip_string[	 ]*)+\/.+
* ! ^Subject: \/.+
* $ $no_subject^0 
{
	message=`echo -e "$message \n\
$xspam No subject.					SpamScore: $="`
	score="$score + $="
}

:0 ED
* ! MATCH ?? [a-z]
* $ $all_caps_subject^0 
{
	message=`echo -e "$message \n\
$xspam No lower case characters in subject.		SpamScore: $="`
	score="$score + $="
}

# Give bad marks for more than one consecutive $CRAP_CHARS in Subject (still
# in "MATCH" from last recipe) and for multiple forwards or includes.
:0 
* $ 3^0.7 MATCH ?? $crap_chars[	 ]*$crap_chars
* $ 4^0.6 MATCH ?? $re_skip_string[	 ]*$re_skip_string
{
	message=`echo -e "$message \n\
$xspam Crap characters in subject. 			SpamScore: $="`
	score="$score + $="
}

:0
* $ $plonk_unsubscribers^0 MATCH ?? ^[	 ]*unsubscribe[	 ]*$
{
	message=`echo -e "$message \n\
$xspam \"unsubscribe\" in subject. 			SpamScore: $="`
	score="$score + $="
}

# Spammers nowadays seem to like to make their subject lines unique by
# appending a number/string in square braquets at the end of their subject
# fields. Oh well, a strong indicator of spam, and easy to spot to boot ;)
:0
* $ $spam_style_uniq_subject^0 MATCH ?? ( )\[[A-z0-9]+\]$
{
	message=`echo -e "$message \n\
$xspam Subject has spam-style unique header.	 	SpamScore: $="`
	score="$score + $="
}

:0 
* $ $crap_domain_score^0 $crap_domains
{
	message=`echo -e "$message \n\
$xspam Header has crap domain in it.			SpamScore: $="`
	score="$score + $="
}

:0 HB
* charset=()\/.+
* $ ! MATCH ?? $ok_charsets
* $ $bad_charset_score^0 
{
	message=`echo -e "$message \n\
$xspam Reference to depreciated charset.		SpamScore: $="`
	score="$score + $="
}

# Do some checks in the body as well, but only if it is in an understandable
# format. 
:0 B
* ! ^Content-Disposition: attachment
* 2^1 $ $crap_chars_body[	 ]*$crap_chars_body[	 ]*$crap_chars_body
{
	message=`echo -e "$message \n\
$xspam Adjacent crap characters in body.		SpamScore: $="`
	score="$score + $="
}

:0 B
* $ $quote_ratio^1 ^>
* -1^1 ^[^>]
{
	message=`echo -e "$message \n\
$xspam Quoted lines exceeds 1:$quote_ratio limit.		SpamScore: $quote_score"`
	score="$score + $quote_score"
}

# Give bad marks for mentioning negative words.
:0 B
* $ $negative_words_score^1 $negative_words
* 5^0
* -5^1 > 2000
{
	message=`echo -e "$message \n\
$xspam Hits for negative words.			SpamScore: $="`
	score="$score + $="
}

# I know where to get porn. Don't bother me with your crap.
:0 B
* $ $porn_score^1 $porn
* 5^0
* -5^1 > 2000
{
	message=`echo -e "$message \n\
$xspam Hits for porn.					SpamScore: $="`
	score="$score + $="
}

:0 B
* $ $positive_words_score^1 $positive_words
* -5^1 > 1000
{
	message=`echo -e "$message \n\
$xspam BODY: Positive words hit.			Score:    -$="`
	score="$score - $="
}

:0 f
| formail -A"$message"

score=`echo $(( $score ));`
# Check whether we went over the limit, and put a report in if we did.
:0 
* $ $score^0
* $ -$limit^0
{
	:0 f
	| formail -A "$xspam Total SpamScore ($score) exceeded limit ($limit) by $= points."`

	# If the message is spam, then record its message id so we can plunk
	# follow-ups to it.
	:0 i:
	* ^Message-Id:\/.+
	DUMMY=| echo $MATCH >> $spam_id

	# Finally, deliver the message in the spam folder if sender is not on
	# whitelist.
	:0 a:
	* ! ? formail -cx"From:" | fgrep -is -f $whitelist
	$spam_folder

	:0 Ef
	| formail -A "$xspam Message indicates spam, but sender is on whitelist."
}