[Gllug] Procmail
Stig Brautaset
stigbrau at start.no
Tue Feb 26 01:16:56 UTC 2002
* Harry <nmeweb at yahoo.co.uk> spake thus:
> I have started to weed on the type of mail particularly
> multipart/alternative.
>
> > :0 :
> > * ? formail -cx"From" -cx"From:" -cx"Sender:" -cx"X-Envelope-Sender:" \
| fgrep -is -f ~/Mail/blacklist
> > ~/Mail/spam
>
> I do like this and I have tested it with my set up and it works a treat. I
> have tried to find a list of recognised domains that spam comes from to
> utilise this as I can name a few who send me crap on a regular basis.
hotmail.com, yahoo.com, aol.com, msn.com, earthlink.net. Esp. the three
last ones. Oh, and don't forget the co.uk clones...
> I would definitely have a look at your file especially if it is
> commented for a numpty like me.
I have tried to keep it well commented precisely because I was trying to
make something that would be simple to use for people that find the
procmail syntax a bit terse. I started it because I didn't like
spamassassin (I couldn't make it work (the way I wanted anyway ;)).
I was thinking of releasing it on my webpage or something, but I am
still hiding behind the "needs more testing" poster. I guess I'm just
afraid someone will point out some serious flaw in my recipes :)
Fact is it actually works *very* well for me. YMMV of course.
Stig
-------------- next part --------------
PATH=$HOME/bin:/usr/local/bin:/usr/bin:/bin
MAILDIR=$HOME/Mail
DEFAULT=in.mbox
LOGFILE=/dev/null
VERBOSE=off
# This is the limit at which we consider the message spam.
limit=28
# This is the version number
version=0.0.6
# Files section.
# This is the folder that spam will go to.
spam_folder=spam.mbox
# The Message_Id: headers of messages that are considered spam are stored
# here, and used to automagically black-hole messages that reference it. This
# is useful on mailing lists where follow-up to spam is a problem.
spam_id=spam_id
# This file is where spammer addresses (possibly partial addresses) are
# stored. Messages to/from any of these are most likely spam, so junk them.
# TODO: make the score for this option configurable as well.
blacklist=blacklist
# This file is the opposite of the one above; mail from these addresses are
# never filed as spam.
whitelist=whitelist
# Sometimes, esp. when you post to various mailing lists or use fetchmail to
# get your mail via a slow connection, you experience getting the same mail
# twice. Set this to 'yes' if you want to recive duplicated messages twice
# (or more!).
recive_duplicated=no
# This file will contain any duplicated messages if the above is set.
duplicates=duplicates
# These variables are here to make the filter a bit easier to tweak; they are
# normal procmail-style regular expressions (CAVEAT: shellmetas must be
# escaped twice.
# A message that does not match any of my addresses in the header is more
# than likely spam.
my_addresses="((stigbrau at start|s-braut at online)\.no|@brautaset\.org)"
# If the message is not to me directly.
cc_to_me=5
# If a message is not addressed to me at all (matches mailing lists such as
# debian-user, but I found out that spam on these lists are much more
# frequent and as such it does make sense to give them a head start in the
# point-gathering process).
not_to_me=10
# Messages from these domains are often spam.
crap_domains="(aol|msn|hotmail|earthlink)\.(co(m|\.uk)|net)"
crap_domain_score=9
# These words in the body of the mail is positive.
positive_words="(Stig|Brautaset|:-\))"
positive_words_score=3
# Char sets different than these are not appreciated.
ok_charsets="(latin|iso-?8859|Windows-1252|us-ascii|utf-8|gb2312)"
bad_charset_score=13
# These words in the body counts against it being a legitimate mail. The
# score will be per word, but long mails that mentions the words sparsely
# will be less severely punished than short mails.
negative_words="(money|dollar|loose\ weight|click\ (here|to)|save|privacy|free\ |remove\ yourself|test|please\ ignore|ignore\ this|unsubscribe\ |e-?mail|microsoft|windows|be\ removed)"
negative_words_score=4
# I know where to get my porn, please don't bother me with yours.
porn="(sex|porn|tits|penis|cock|erection|hardon|horny|slut\ |uncensored|privacy|celeb|hotties|viagra|whore|toonz|adult|teen|pictures|video|membership|nude)"
porn_score=8
# I really detest excessive punctuation. This is for the subject line.
crap_chars="(!|?|\\*|\\$|\||-|\"|\')"
# Same as above, but many people use '-' and '*' in signatures, so this
# regular expression is for the body.
crap_chars_body="(!|?|\\||\\$)"
# This is the regular expression that we should skip when looking at subject
# lines.
re_skip_string="(re:|fwd:|fw:|sv:|\[gllug\])"
# I absolutely hate html mail. Set to "0" if you don't want to strip html.
html_content_score=13
# If the subject line ends in "[aoe09]" or something similar, it better not
# breach anything else; this is a *very* strong indication on spam.
spam_style_uniq_subject=`echo $(($limit - 1));`
# People that try to unsubscribe mailinglists annoy me to bits.
plonk_unsubscribers=`echo $(($limit - $not_to_me + 1));`
# I don't like lots of quoted material. The following means that the post
# have to have twice as many non-quoted lines as normal lines not to get the
# score. Set it to any value you wish. The score is the one that will be
# added to the message, regardless of severity. This might change.
quote_ratio=0.3
quote_score=5
# More severity configure options.
no_subject=17
all_caps_subject=13
not_plain_text=7
# The actual rule set begins here. Little configuration should be needed from
# this point on.
# If you want a different header to the "X-SPAM:" thingie that is default, it
# can be changed here.
xspam="X-SPAM:"
# Do not just remove this line. Because we add a newline before any of the
# extra X-SPAM: lines below, this variable needs to be set.
message="$xspam This is SpamTracker version $version by Stig Brautaset."
score=0
# Keep a backup copy of the latest 200 messages.
:0 c
backup
:0 ic
| cd backup && rm -f dummy `ls -t msg.* | sed -e 1,200d`
# We don't want to get duplicate mails.
:0 Whc: msgid.lock
* recive_duplicated ?? no
| formail -D 16384 msgid.cache
:0 a:
$duplicates
# This recipe is for adding addresses to the white/blacklists.
:0
* $ ^To.*$my_addresses
* $ ^From.*$my_addresses
{
# Add addresses from the subject line to the list of known good
# senders.
:0 i:
* ^Subject:.*white[ ]*\/.+
| echo $MATCH >> $whitelist
# Add addresses from the subject line to the list of known spammers.
:0 i:
* ^Subject:.*black[ ]*\/.+
| echo $MATCH >> $blacklist
# Just plonk the thread, not the sender.
:0 w:
* ^Subject:.*thread
| formail +1 -ds formail -x"Message-Id:" >> $spam_id
# If I forward spam to myself, the sender(s) of the original mail
# should be blacklisted, and the thread will be classified as spam.
:0
* ^Subject: Fwd:
{
# It should not be neccessary to take a copy of the message,
# but I have not found out how to deal with this otherwise.
:0 c:
| formail +1 -ds formail -cx"Message-Id:" >> $spam_id
:0 c:
| formail +1 -ds formail -cx"From:" | sed -e 's/.*\( \|<\)\([[:alnum:]@._-]\+.[A-z]\+\).*/\2/' >> $blacklist
:0 :
| formail +1 -ds >> $spam_folder
}
}
# First we take out mail that is an answer to a mail previously black holed,
# or mail from people on our shitlist. This saves a bit of processing, and is
# very useful on mailing lists where much of the spam is people commenting on
# or replying to spam.
:0
* 1^0 ? formail -cx"References:" -x"In-Reply-To:" | fgrep -is -f $spam_id
* 1^0 ? formail -cx"From" -cx"From:" -cx"Sender:" -cx"X-Envelope-Sender:" | fgrep -is -f $blacklist
{
:0 i:
* ^Message-Id:\/.+
| echo $MATCH >> $spam_id
:0 hf
| formail -A"$xspam Follow-up to previously catched spam, or from a blacklisted address."
:0 :
$spam_folder
}
# Filter out HTML, and leave a message that it is filtered. I am forever
# grateful to Bart Schaefer for this one.
:0
* ^Content-Type: text/html
* $ $html_content_score^0
{
message=`echo -e "$message\n\
$xspam HTML Content. SpamScore: $="`
score="$score + $="
:0 bfW
| (echo "[html stripped]"; lynx -dump -force_html -stdin)
:0 ahfw
| formail -i"Content-Type: text/plain"
}
:0
* ^Cc:\/.*
* $ MATCH ?? $my_addresses
* $ $cc_to_me^0
{
message=`echo -e "$message\n\
$xspam Message not directly to me. SpamScore: $="`
score="$score + $="
}
TMP=$MATCH
:0 E
* ^To:\/.*
* $ ! MATCH ?? $my_addresses
* $ $not_to_me^0
{
message=`echo -e "$message \n\
$xspam Message not addressed to me. SpamScore: $="`
score="$score + $="
}
:0 i
* ^To:\/.*
TMP=| echo "$TMP $MATCH"
# Messages with many recipients are depreciated.
:0
* -5^0
* 5^0.8 TMP ?? @
{
message=`echo -e "$message \n\
$xspam To:/Cc: contain several addresses. SpamScore: $="`
score="$score + $="
}
:0
* ^Content-Type:\/.*
* ! MATCH ?? (text/plain|multipart/signed)
* $ $not_plain_text^0
{
message=`echo -e "$message \n\
$xspam Format of message is not plain text. SpamScore: $="`
score="$score + $="
}
# If the message has no subject, or it consists entirely of spaces/tabs, it's
# likely spam, and SpamScore is written to the header. Otherwise extract the
# subject but dump any occurrence of Re:/fwd:/SV: in the beginning of the line
# and send it to the second part of this recipe which will check whether the
# extracted part contain any lower-case characters. If not, SpamScore are
# written to the header. Thanks to David W. Tamkin for this technique.
:0
* $ ! ^Subject:[ ]*($re_skip_string[ ]*)+\/.+
* ! ^Subject: \/.+
* $ $no_subject^0
{
message=`echo -e "$message \n\
$xspam No subject. SpamScore: $="`
score="$score + $="
}
:0 ED
* ! MATCH ?? [a-z]
* $ $all_caps_subject^0
{
message=`echo -e "$message \n\
$xspam No lower case characters in subject. SpamScore: $="`
score="$score + $="
}
# Give bad marks for more than one consecutive $CRAP_CHARS in Subject (still
# in "MATCH" from last recipe) and for multiple forwards or includes.
:0
* $ 3^0.7 MATCH ?? $crap_chars[ ]*$crap_chars
* $ 4^0.6 MATCH ?? $re_skip_string[ ]*$re_skip_string
{
message=`echo -e "$message \n\
$xspam Crap characters in subject. SpamScore: $="`
score="$score + $="
}
:0
* $ $plonk_unsubscribers^0 MATCH ?? ^[ ]*unsubscribe[ ]*$
{
message=`echo -e "$message \n\
$xspam \"unsubscribe\" in subject. SpamScore: $="`
score="$score + $="
}
# Spammers nowadays seem to like to make their subject lines unique by
# appending a number/string in square braquets at the end of their subject
# fields. Oh well, a strong indicator of spam, and easy to spot to boot ;)
:0
* $ $spam_style_uniq_subject^0 MATCH ?? ( )\[[A-z0-9]+\]$
{
message=`echo -e "$message \n\
$xspam Subject has spam-style unique header. SpamScore: $="`
score="$score + $="
}
:0
* $ $crap_domain_score^0 $crap_domains
{
message=`echo -e "$message \n\
$xspam Header has crap domain in it. SpamScore: $="`
score="$score + $="
}
:0 HB
* charset=()\/.+
* $ ! MATCH ?? $ok_charsets
* $ $bad_charset_score^0
{
message=`echo -e "$message \n\
$xspam Reference to depreciated charset. SpamScore: $="`
score="$score + $="
}
# Do some checks in the body as well, but only if it is in an understandable
# format.
:0 B
* ! ^Content-Disposition: attachment
* 2^1 $ $crap_chars_body[ ]*$crap_chars_body[ ]*$crap_chars_body
{
message=`echo -e "$message \n\
$xspam Adjacent crap characters in body. SpamScore: $="`
score="$score + $="
}
:0 B
* $ $quote_ratio^1 ^>
* -1^1 ^[^>]
{
message=`echo -e "$message \n\
$xspam Quoted lines exceeds 1:$quote_ratio limit. SpamScore: $quote_score"`
score="$score + $quote_score"
}
# Give bad marks for mentioning negative words.
:0 B
* $ $negative_words_score^1 $negative_words
* 5^0
* -5^1 > 2000
{
message=`echo -e "$message \n\
$xspam Hits for negative words. SpamScore: $="`
score="$score + $="
}
# I know where to get porn. Don't bother me with your crap.
:0 B
* $ $porn_score^1 $porn
* 5^0
* -5^1 > 2000
{
message=`echo -e "$message \n\
$xspam Hits for porn. SpamScore: $="`
score="$score + $="
}
:0 B
* $ $positive_words_score^1 $positive_words
* -5^1 > 1000
{
message=`echo -e "$message \n\
$xspam BODY: Positive words hit. Score: -$="`
score="$score - $="
}
:0 f
| formail -A"$message"
score=`echo $(( $score ));`
# Check whether we went over the limit, and put a report in if we did.
:0
* $ $score^0
* $ -$limit^0
{
:0 f
| formail -A "$xspam Total SpamScore ($score) exceeded limit ($limit) by $= points."`
# If the message is spam, then record its message id so we can plunk
# follow-ups to it.
:0 i:
* ^Message-Id:\/.+
DUMMY=| echo $MATCH >> $spam_id
# Finally, deliver the message in the spam folder if sender is not on
# whitelist.
:0 a:
* ! ? formail -cx"From:" | fgrep -is -f $whitelist
$spam_folder
:0 Ef
| formail -A "$xspam Message indicates spam, but sender is on whitelist."
}
More information about the GLLUG
mailing list