[Lancaster] Re. Spam email from China

Mark Williams llug at lodestar.icom43.net
Wed Jul 3 02:42:46 UTC 2013


On Sun, 30 Jun 2013 16:00:13 +0100, Ken Walker <keneawalker at gmail.com> wrote:

> Hi Ken,
>
> I've just checked google for this. Apparently spam assassin can remove
> emails based on language.

[and prev:]
On Sun, 30 Jun 2013 09:35:26 +0100, Ken Hough <kenhough at btinternet.com> wrote:

> What is the best way to filter spam email that contains Chinese characters?

SpamAssassin can filter mail based on language by explicitly allowing some
languages (but cf ** below) - see ok_locales in:

http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html

and (re. v3.3) ok_languages in:

http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Plugin_TextCat.h
tml

and for an example of how to use (tho' it assumes you're feeding into amavis,
but no matter) see:
http://blog.le-vert.net/?p=67

**
Allowing languages or locales/character-sets and scoring the rest as likely to
be spam isn't really satisfacory. To score mail in specific language(s) as
likely to be spam you need a different approach; for an example see:

http://www.void.gr/kargig/blog/2011/05/01/block-spam-with-russian-encoding-using
-spamassassin/

where KOI8-R is a Cyrillic character encoding. Note that the score does not
block the mail, but causes it to be classed as spam if the total score is high
enough.

Most Chinese spam is character set GB2312, but a small minority is UTF-8. If you
look at the message source you will see charset="gb2312" for the content and
=?GB2312 or =?gb2312 usually prefixing (or within) the Subject line and often
the From line. I've not (yet) seen any spam in Big5 Traditional Chinese.

I don't know what to do about Chinese UTF-8 spam unless you can filter on the
characters for job, property, good fortune, etc :-(  I guess it's down to block
lists and improvements to Bayesian capabilities.

--
Mark Williams





More information about the Lancaster mailing list